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ABSTRACT 



A non-blocking load buffer is provided for use in a high- 
speed microprocessor and memory system. The non- 
blocking load buffer interfaces a high-speed processor/cache 
bus. which connects a processor and a cache to the non- 
blocking load buffer, with a lower speed peripheral bus, 
which connects to peripheral devices. The non-blocking 
load buffer allows data to be retrieved from relatively low 
bandwidth peripheral devices directly from prograrnrned I/O 
of the processor at the maximum rate of the peripherals so 
that the data may be processed and stored without unnec- 
essarily idling the processor. I/O requests from several 
processors within a multiprocessor may simultaneously be 
buffered so that a plurality of non-blocking loads may be 
processed during the latency period of the device. As a 
result, a continuous maximum throughput from multiple I/O 
devices by the programmed I/O of the processor is achieved 
and the time required for completing tasks and processing 
data may be reduced. Also, a multiple priority non-blocking 
load buffer is provided for serving a multiprocessor running 
real-time processes of varying deadlines by prioritization- 
based scheduling of memory and peripheral accesses. 

27 Claims, 10 Drawing Sheets 
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SYSTEM FOR PLACING ENTRIES OF AN 
OUTSTANDING PROCESSOR REQUEST 
INTO A FREE POOL AFTER THE REQUEST 
IS ACCEPTED BY A CORRESPONDING 
PERIPHERAL DEVICE 

CROSS REFERENCE TO RELATED 
APPLICATION 

This application is related to a co-pending application 
filed on Jun. 7. 1995 having a Ser. No. 08/480,738 (now 
pending). 

FIELD OF THE INVENTION 

The present invention relates to interfacing between a 
microprocessor, peripherals and/or memories. More 
particularly, the present invention relates to a data process- 
ing system for interfacing a high-speed data bus, which is 
connected to one or more processors and a cache, to a 
second, possibly lower speed peripheral bus. which is con- 
nected to a plurality of peripherals and memories, by a 
non-blocking load buffer so that a continuous maximum 
throughput from multiple I/O devices is provided via pro- 
grammed VO (memory-mapped I/O) of the processors and a 
plurality of non-blocking loads may be simultaneously per- 
formed. Also, the present invention is directed to a multiple- 
priority non-blocking load buffer which serves a multipro- 
cessor for running real-time processes of varying deadlines 
by priority-based non-FIFO scheduling of memory and 
peripheral accesses. 

BACKGROUND OF THE INVENTION 

Conventionally, microprocessor and memory system 
applications retrieve data from relatively low bandwidth VO 
devices, process the data by a processor and then store the 
processed data to the low bandwidth I/O devices. In typical 
microprocessor memory system applications, a processor 
and a cache are directly coupled by a bus to a plurality of 
peripherals such as I/O devices and memory devices. 
However, the processor may be unnecessarily idled due to 
the latency between the I/O devices and the processor, thus 
causing the processor to stall and, as a result, excessive time 
is required to complete tasks. 

In known processing systems, when an operation is per- 
formed on one of the peripherals such as a memory device, 
the time between performing such an operation and subse- 
quent operations is dependent upon the latency period of the 
memory device. Thereby, the processor will be stalled 
during the entire duration of the memory transaction. One 
solution far improving the processing speed of known 
processing systems is to perform additional operations dur- 
ing the time of the latency period as long as there is no 
load-use dependency upon the additional operations. For 
example, if data is loaded into a first register by a first 
operation (1), where the first operation (1) corresponds to: 

load 

rlHrt] (i) 

and the first register is added to another operation by a 
second operation (2) where the second operation (2) corre- 
sponds to: 
add 

r34-rl«4 (2) 

the operation (2) is load-use dependent and the operation (2) 
must wait for the latency' period before being performed. 



*7 ; 547 

2 

FIGS. 1(a) and 1(b) illustrate a load-use dependent opera- 
tion where the time for initiating the operation (2) must wait 
until the operation (1) is completed. Operation (2) is depen- 
dent upon the short latency period t x in FIG. 1(a) corre- 

5 .s ponding to a fast memory and the long latency period t 2 in 
FIG. 1 (b) corresponds to a slower memory. 

A non-Mocking cache and a non-blocking processor are 
known where a load operation is performed and additional 
operations other than loads may be subsequently performed 

10 during the latency period as long as the operation is not 
dependent upon the initial load operation. FIGS. 2(a) and 
2(b) illustrate such operations. In operation (1), the first 
register is loaded. Next, operations (1.1) and (1.2) are to be 
executed As long as operations (1.1) and (12) are not load 

13 dependent on another load, these operations may be per- 
formed during the latency period as illustrated in FIG. 
2(a). However, if operation (1.1) is either load-dependent or 
a pending load, operations ( 1. 1) and ( 1.2) must wait until the 
latency period t? ends before being performed. 

20 Also known is a Stall On Use (Hit Under Miss) operation 
for achieving cache miss optimizations as described in "A 
200 MFLOP Precision Architecture Processor" at Hot Chips 
IV, 1993. William Jaffe et al. In this Hit Under Miss 
operation, when one miss is outstanding only certain other 

25 types of instructions may be executed before the system 
stalls. For example, during the handling of a load miss, 
execution proceeds until the target register is needed as an 
operand for another instruction or until another load miss 
occurs. However, this system is not capable of handling two 

30 misses being outstanding at the same time. For a store miss, 
execution proceeds until a load or sub-word store occurs to 
the missing line. 

This Hit Under Miss feature can improve the runtime 
performance of general-purpose computing applications, 

35 Examples of programs that benefit from the Hit Under Miss 
feature are SPEC benchmarks. SPICE circuit simulators and 
gec C compilers. However, the Hit Under Miss feature does 
not sufficiently meet the high I/O bandwidth requirements 
for digital signal processing applications such as digital 

40 video, audio and RF processing. 

Known microprocessor and memory system applications 
use real-time processes which are programs having dead- 
lines corresponding to times where data processing must be 
completed For example, an audio waveform device driver 

45 process must supply audio samples at regular intervals to the 
output buffers of the audio device. When the driver software 
is late in delivering data for an audio waveform device 
driver, the generated audio may be interrupted by objection- 
able noises due to an output buffer underfl owing. 

50 In order to analyze whether or not a real-time process can 
meet its deadlines under all conditions requires predictabil- 
ity of the worst-case performance of the real-time processing 
program. However, the sensitivity of the real-time process- 
ing program to its input data or its environment makes it 

55 impractical in many cases to exhaustively check the behav- 
ior of the process under all conditions. Therefore, the 
programmer must rely on some combination of analysis and 
empirical tests to verify that the real-time process will 
complete in the requisite tune. The goals of real-time 

60 processing tend to be incompatible with computing plat- 
forms that have memory or peripheral systems in which the 
latency of the transactions is unpredictable because an 
analysis of whether the real-time deadlines can be met may 
not be possible or worst-case assumptions of memory per- 

65 forma oce are required. For example, performance estimates 
can be made by assuming that every memory transaction 
takes the maximum possible time. However, such an 
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assumption may be so pessimistic that any useful estimate similar reduction on the effect of latency for non-blocking 

for the upper bound on the execution time of a real-time task stores without requiring the additional temporary storage 

cannot be made. Furthermore, even if the estimates are only memory of a separate store buffer. 

slightly pessimistic, overly conservative decisions will be Another object of the present invention is to provide a 

made for the hardware performance requirements so that a 5 non-blocking load buffer which buffers I/O requests from 

system results that is more expensive than necessary. one or more processors within a multiprocessor so that a 

Also, it is especially difficult to reliably predict real-time plurality of I/O memory transactions, such as loads and 

processing performance on known multiprocessors because stores, may be simultaneously performed and the high I/O 

the memory and peripherals are not typically multi-ported. bandwidth requirements for digital signal processing appli- 

Therefore simultaneous access by two or more processors to 10 cations may be met. 

the same memory device must be serialized. Even if a device A s tJH further object of the present invention is to provide 
is capable of handling multiple transactions in parallel, the a multiple-priority non-blocking load buffer for serving a 
bus shared by all of the processors may still serialize the multiprocessor running real-time processes of varying dead- 
transactions to some degree. lines by using a priority-based method to schedule memory 

If memory requests are handled in a FIFO manner by a 15 aJjd pa^cial accesses, 

known multiprocessor, a memory transaction which arrives ^ q of wc m invcDlion « fulfilled by 

slightly later than another memory transaction may take a idm ^ ^ ssiag systcm comprising a first data 

much longer amount of tune to complete since the later ^ transferring data requests at a first speed, a second 

arriving memory requests must wait until the earlier memory ^ fa tonsmmin j/q data of a second speed, and a 

request is serviced Due to this sensitivity, very small 20 load buffeT connected to the first and second 

changes in the memory access patterns of a program can ^ ^ f<jr M (hc ^ 5ts and the VQ ^ s0 

cause large changes in its performance. This situation grows ^ & ^ Qf ^ ^ ^ ^ ^ 

worse as more processors share the same memory. For sim ^ ousJy . u is possible for the speed of the second bus 

example, if ten processors attempt to access the same remote to ^ slower ^ me ^ speed of me first bus. As a result 

memory location simultaneously. ^ the spread in memory 25 ^ ^ retricved frora relativcly low bandwidth I/O 

latency among the processors might be 10:1 because as ^ ^ retricvcd ^ ^ ^ processed and stored 

to 1- bandwidth I/O devices without idling the system 

225tt^ .afurtherembodir.^^ 

memory accesses. Very small changes to a program or its *e present invention, the first data bus is connected to a 

Spmcita may cause ^he program to exhibit sh ^operation processor and the second data bus is connected to a plurality 

differences which perturb thl timing of the memory trans- of peripherals and memories. TTje data processing system for 

actions ^ this embodiment reduces the effect of latency of the penph- 

r^mcrmore. types of memory which exhibit locality 35 erals and meinories so that the application code, which uses 

effects may exacerbate the abovc^iescribed situation. For programmed I/O. may meet its rcal-Ume constraint, 

example, accesses to DRAMs are approximately two times Furthermore the code may have reduce^nstruction sched- 

fasteTif executed in page mode. Tol.se page mode, a recent uling constounts and a near maxmium d^ughput may be 

access must have been made to an address in the same achieved from the I/O devices of varying speeds, 

memory segment (page). One of the most common access 40 Another einbodiment of the present invention is fulfilled 

patterns is sequential accesses to consecutive locations in by providing a non-blocking load buffer comprising a 

memory. These memory patterns tend to achieve high page memory array for temporarily storing data, and a control 

locality, thus achieving high throughput and low latency. block for simultaneously rjerforrning a plurality of daia loads 

Known programs which attempt to take advantage of the and stores. The non-blocking load buffer for this embodi- 

benefits of page mode may be thwarted when a second 45 ment allows I/O requests from several processors to be 

program executing on another processor is allowed to inter- performed at once. 

pose memory transactions on a different memory page. For Further scope of applicability of the present invention will 

instance, if ten processors, each with its own sequential become apparent from the detailed description given here- 

memory access pattern, attempt to access the same DRAM after. However, it should be understood that the detailed 

bank simultaneously and each of the accesses is to a different 50 description and specific examples, while indicating pre- 

memory page, the spread and memory latencies between the f erred embodiments of the invention, are given by way of 

fastest and slowest responses might be more than 25:1. illustration only, since various changes and modifications 

The present invention is directed to allowing a high rate within the spirit and scope of the invention will become 

of transfer to memory and I/O devices for tasks which have apparent to those skilled in the art from this detailed descrip- 

real-time requirements. The present invention is also 55 ti on . 
directed to allowing the system to buffer I/O requests from 

several processors within a rnultiprocessor at once with a BRIEF DESCRIPTION OF THE DRAWINGS 

non-blocking load ^FurtherriK^ me r*ese^ become more fully understood 

is directed to extending the basic non-blocking load buffer to ^^ed description given hereinbelow and the 

serviceadata prying syst^ 60 g ^wrngs Wichlre given by way of fllus- 

of varying deadlines by using sc ^| r ^ mc ^. and S?3? and thus are not limitative of the present 

peripheral accesses which is not strictly FIFO scheduling. whcrcin: 

SUMMARY OF THE INVENTION pjos. 1(a) and l{b) illustrate time dependency in a 

An object of the present invention is to reduce the effect 65 conventional load use operation; 

of memory latency by overlapping a plurality of non- FIGS. 2(a) and 2(b) illustrate time dependency in a known 

blocking loads during the latency period and to achieve a Stall -On -Use operation; 



Case 2:05-cv-00505-TJW Document 1 29 Filed 09/1 2/2007 Page 1 5 of 21 



5,737,547 



10 



25 
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FIG. 2(c) illustrates time dependency in a non-blocking 
load buffer for an embodiment of the present invention; 

FIG. 3(o) illustrates unpredictable memory latency in a 
known multiprocessor system; 

FIG. 3(2>) illustrates memory latency in a multiprocessor 
system having a multiple-priority version of the non- 
blocking load buffer; 

FIG. 4 illustrates a block diagram for the data processing 
system according to one embodiment of the present inven- 
tion; 

FIG. 5 is a schematic illustration of the non-blocking load 
buffer for an embodiment of the present invention; 

FIG. 6 illustrates an example of the contents for the 
entries in the memory of the data processing system for an 
embodiment of the present invention; 

FIGS. 7(a) and 7(b) illustrate examples of the possible 
states for a non-blocking load through the memory of the 
data processing system for an embodiment of the present 
invention; 

FIGS. 7(c), 7(d), 7(e). 7(f), and 7{g) illustrate the progress 
of addresses through queues of the data processing system in 
an embodiment of the present invention; 

FIG. 8 illustrates the circuitry used for the non-blocking 
load buffer in an embodiment of the present invention; 

FIGS. 9(a) and 9(b) illustrate parallel pending queues 
which allow prioritization of data in the data processing 
system for an embodiment of the present invention; 

FIG. 10 illustrates a detailed block diagram of an embodi- 
ment of the non-blocking load buffer, and 

FIG. 11 illustrates a flow chart of the control of the 
pending queue in an embodiment of the present invention. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENTS 

The first embodirneDt of the present invention will be 
discussed with reference to FIG. 4. The data processing 
system includes a non-blocking load buffer 10, a cache 20, 

one or more processors 30q, 30, 30 m . a processor/cache ^ 

bus 40 for connecting the non-blocking load buffer 10. the 
cache 20 and the processors 30 0 M a plurality of peripheral 

devices 50 o . 50, 50„. and a peripheral bus 60, which 

includes an output bus 61 and an input bus 62, for connect- 
ing the non-blocking load buffer 10 with the plurality of 45 
peripheral devices 50 OjJ . Some or all of the peripheral 
devices 50 o ^ may be memories, such as DRAM for 
example, which serve as a backing store for the cache 20 or 
as memory which the processors 30 o M access directly 
through the non-blocking load buffer 10, thus bypassing the yy 
cache 20. 

The processor/cache bus 40, which includes an input bus 

41 and an output bus 42, transfers data from the cache 20 to 
registers of the processors 30 o ^ as loads via the output bus 

42 or alternatively, from the registers of the processors 30 o M 55 
to the cache 20 as stores via the input bus 41. All I/O 
operations, which are programmed I/O, and any memory 
operations, even those which bypass the cache 20, are 
transmitted over the processor/cache bus 40 as well. For 
such operations the source or sink of the data is one of the 60 
peripheral devices 50 o such as a DRAM, for example. 

The bandwidth of the peripheral bus 60 should be chosen 
to meet the peak input or output bandwidth which is to be 
expected by the peripheral devices 50 0 ^. The bandwidth of 
the non-blocking load buffer 10 should therefore be chosen 65 
to at least equal the sum of the bandwidth of the processor/ 
cache bus 40 and the bandwidth of the peripheral bus 60. 



The processors 30 o m may be designed with dependency- 
checking logic so that they will only stall when data 
requested by a previously executed load instruction has not 
yet arrived and that data is required as a source operand in 
a subsequent instruction. FIG. 2(c) illustrates the time 
dependency for the non-blocking load buffer 10 in an 
embodiment of the present invention. In FIG. 2(c). load 
instructions arc executed at operations (1). (1.1). (1.2). . . . 
(Iji). In this example, operations (1). (1.1). (1.2). . . . (l.n) 
correspond to: 

load 



rlHrtl 



15 load 



load 



i6«-[r2+321 



function 



zl 



OX 



(1.2* 



(1*) 



In this example, operation (2) executes an instruction depen- 
dent on operation (1). operation (2.1) executes an instruction 
dependent on operation (1.1). operation (2.2) executes an 

instruction dependent on operation (1.2) operation (2.n) 

executes an instruction dependent on operation (l.n). For 
instance, operations (2.1). (22), . . . (2.n) correspond to: 



35 



add 



add 



r3t-rl+i4 

I 

I 

. function 
*2 



<2X 



(2.1). 



(2-2); 



1 may be any positive integer. Operations (1.1). (1.2) 

(l.n) occur as fast as possible and the time between 
these operations is dependent upon the processor cycle 
for each operation. As shown in FIG. 2(c), operation (2) 
waits only for a part of the latency period 4 between 
operations (1) and (2). which is represented by t,, since 
the part of the latency period t> between operations (1) 
and (2) is overlapped with the operations (1.1). (12). . 
. . (l.n). Thereafter, operations (2.1) and (2.2) occur as 
fast as possible after operation (2) with the time limi- 
tation being the processor cycle. The operations (2.1) 
and (2J2) do not wait for the full latency period u of 
operation (LI) since the latency period t^ is partially 
overlapped with operations (I.I). (1.2), . . . (l.n) and 
the latency period ^ for operation (1). For values of n 
sufficiently large, I, may effectively be reduced to zero 
and the processor will not suffer any stalls due to 
memory or I/O latency. In this embodiment, there may 
be more man one outstanding load to slower memories 
or peripherals* as shown at instant t, in FIG. 2(c) for 
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example, and requests may thereby be pipelined to reference to FIG. 7(a), wben one of the processors 30 o 

provide continuous maximum throughput from the issues a non-blocking load, a free entry is chosen and the 

plurality of peripheral devices 50 o M from programmed address of me load is stored in this entry. Additionally, space 

I/O of the processor. for the returning data is allocated. This entry now enters the 
The non-blocking load buffer 10 may accept both load and 5 pending queue, one of the queues 114o „. corresponding to 

store requests from the processors 30 o „ to the peripheral the addressed peripheral device. When the peripheral device 

devices 50 o . The non-blocking load buffer 10 may accept is ready, the pending entry is delivered over the peripheral 

requests at the maximum rate issued by a processor for a bus 60 and becomes an outstanding entry. Wheo the out- 

period of time that is limited by the amount of intrinsic standing entry returns from the peripheral device via the 
storage in the non-blocking load buffer 10. After accepting to peripheral bus 60. the accompanying requested load data is 

requests, the non-Mocking load buffer 10 then forwards the written to the part of the entry that had been allocated for the 

requests to the peripheral devices S0 o M at a rate which the returning data and the entry is placed in the return queue 116. 

peripheral devices 50 o may process them. However, the The data is buffered until- the one processor is ready to 

non-blocking load buffer 10 does not have to wait until all receive the data over the processor/cache bus 40 (i.e., until 

of the requests are accepted. the one processor is not using any required cache or register 

FIG. 5 illustrates internal architecture for the non- file write ports). Finally, the data is read out and returned to 

blocking load buffer in an embodiment of the present the one processor over the processor/cache bus 40 and the 

invention. The internal architecture of the non-blocking load entry again becomes free. In addition, a peripheral device 

buffer 10 includes a plurality of variable-length pending may accept multiple requests before returning a response, 
queues 114a 114,. . . . 114 n which contain entries that 20 FIG. 7(b) illustrates the progress of an entry during a store 

represent the state of requests made by the processors 30 o _ TO operation. The store operation is similar to the load opera- 

to the peripheral devices 50 o „. The pending queues 114 0 mM tion but differs because data does not need to be returned to 

correspond to each of the peripheral devices 50 o M and are the processors 30 o _ and this return step is therefore, 

used to hold the address and control information for a load skipped. Once a store is accepted by the non-blocking load 
or a store before being transmitted to the corresponding 25 buffer It. the store is treated by the issuing processor as if 

peripheral device. If an entry in the pending queues 114 0 >JB the data has already been transferred to the target peripheral 

contains a store transaction, then that entry will also contain device, one of the peripheral devices 5% ^ 

the data to be written to the corresponding peripheral device. The non-blocking load buffer 10 performs similarly to a 

Another component of the non-blocking load buffer 10 is a rate-matching FIFO but differs in several ways. For instance, 
variable-length return queue 116 used to buffer the data 30 one difference is mat although memory transactions for- 

retumed by any of the peripheral devices 50 o n in response warded by the non-blocking load buffer 10 are issued to any 

to a load until the data can be transmitted to the requesting particular peripheral device 50 o „ in the same order that they 

processor. In addition, mere is a free pool 112 that contains were requested by any of the processors 30 Ojn . the trans- 

any unused request entries. Once a load request is sent from actions are not necessarily FIFO among different peripheral 
one of the pending queues 114 0 „ 10 its corresponding 35 devices. This is similar to having a separate rate-matching 

peripheral device, the load request is marked as "outstand- FIFO for outputing to each of the peripheral devices S0 Ojl , 

ing~ until a return data value is sent by the peripheral and except that the storage memory for all the FIFOs is shared 

enqueued in the return queue 116. At any time, a given and is thereby more efficient Also, the peripheral devices 

request entry is either outstanding, in one of the pending 50 O jl are not required to return the results of a load request 
queues U4o n . in the return queue 116. or is not in use and 40 to the non-blocking load buffer 10 in the same order that the 

is therefore located in the free pool 112. requests were transmitted. . „ 

FIG 6 illustrates an example of a memory structure 200 The non-blocking load buffer 10 is capable of keeping all 

for storing the information present in each entry of the of the peripheral devices 50^. busy because an indepen- 

non-Wocking load buffer 10. E entries may be stored in the dent pending queue exists for each of the peripheral devices, 
non-blocking load buffer 10 and each entry holds one of a 45 A separate flow control signal from each of the peripheral 

plurality of requests 201,. 201 2 201* which can either devices indicates that this corresponding peripheral device is 

be a load or a store to any of the peripheral devices 50 0 vi . ready. Accordingly, a request for a fast peripheral device 

Each request 201 includes a control information 202. an does not wait behind a request for a slow peripheral device, 

address 204 and data 206. However, for a load operation, the FIGS. 7(c). 7(d). 7(e\ 7(0 and 7(g) illustrate examples of 

data 206 is empty until data returns from the peripheral so a non-blocking load progressing through the queues. After 

devices 50 o The control information 202 describes param- the queues are reset, entries 1. 2. 3 and 4 are in the free pool 

eters such as'whether the transaction is a load or a store, the as shown in FIGS. 7(c). A non-blocking load is pending after 

size of data transferred, and a coherency flag that indicates the entry 1 is written into one of the pending queues as 

whether all previous transactions must have been satisfied shown in FIG. 1(d). Next, FIG. 7(e) illustrates that entry 1 

before the new transaction is permitted to be forwarded to 55 becomes outstanding after the pending load is accepted by 

the addressed peripheral device. The coherency flag its corresponding peripheral device. Subsequently, the toad 

enforces sequential consistency, which is occasionally a is returned from the peripheral device and entry 1 is now in 

requirement for some I/O or memory operations. The data the returned queue as illustrated in FIG. 7(f). and awaits 

stored in the entries 201 need not be transferred among the transmission to the requesting processor (or to the cache if 

queues of the non-blocking load buffer 10. Rather pointers 60 the access was part of a cache fill request). Finally, in FIG. 

to individual entries within the memory structure 200 may 1(g). entry 1 is released to the free pool and, once again, all 

be exchanged between the pending queues 114 0 J1 , the return entries are free. _ _ 

queue 116 and the free pool 112 for example. FIG- 8 illustrates the circuitry used for the nonlocking 

FIGS 7(a) and 7(b) illustrate examples of the operation of load buffer in an embodiment of the present invention which 

the non-blocking load buffer 10 which wfll be input to a 65 implements variable depth FIFOs. This circuitry manipu- 

pcriphcral device such as a RAM. Initially, all entries of the lates pointers (i.e. RAM addresses for example) to entries in 

non-Wocking load buffer 10 are unused and free. With a RAM 130 instead of storing the addresses of only the head 
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and tail of a contiguous list of entries in the RAM 130. by processor B to the same memory device. The first 
Furthermore, the circuitry does not require the pointers to be memory transaction issued by processor B. labeled Bl. will 
sequential. Therefore, records need not be contiguous and be executed immediately. However, the latency of the see- 
the RAM 130 can contain several interleaved data structures ond transaction, labeled B2. by processor B. t^, will be 
allowing more flexible allocation and de-allocation of 5 considerably longer than the latency of the first transaction, 
entries within a FIFO so that it is possible to use the same t Bl . since the second transaction B2 will not be handled by 
storage RAM 130 for multiple FIFOs. The entire circuit in the memory system until all of the requests Al. A2. and A3 
FIG. 8 represents a large FIFO of an unspecified data width by processor A have been processed, 
which is controlled by a smaller FIFO represented by block One consideration for overcoming all of these unpredict- 
170 having pointer widths smaller than the records. In FIG. to able memory latency effects is to have the programmer 
S. a first control circuit 110. a second control circuit 120. the carefully design the times at which each real-time process 
RAM 130 having separate read and write ports, a plurality attempts to access the memory which effectively has soft- 
of shift registers 150^ 150 2 . 150 3 . . . . 150„ corresponding ware serialize access to the memory banks instead of hard- 
to each entry in the RAM 130 and a multiplexer 160 are ware. For instance, two processors may alternate memory 
illustrated The control circuits 110 and 120 provide pointers 15 accesses at regular intervals to match the throughput capa- 
to the entries of the RAM 130 rather than the entries bilities of the memory system. However, such prograrnroing 
themselves being queued. Block 170 is essentially a FIFO of places a heavy burden on both the program and the pro- 
pointers. A pointer may be enqueued in the FIFO of pointers grammer because the programmer must be aware of the 
in block 170 by asserting the "push" control line active and derailed temporal behavior of the memory access patterns by 
placing the pointer (SRAM 130 address) on the "push-addr" 20 the program. The behavior of the programs is usually very 
input lines to block 170. Similarly, a pointer value may be complex and may change based on its input dam. which is 
dequeued by asserting the "pop" control line on block 170 not always known. Therefore, it is impractical for the 
with the pointer value appearing on the 4A pop-addr" bus. programmer to know the detailed temporal behavior of the 
Block 170 has a "full" output that when active indicates mat memory access patterns by the program, 
no more pointers may be enqueued because all the registers 25 Also, because the combinations of processors that are 
150 ! p are occupied and the queue is at its maximum depth, making the requests are even more difficult to predict and the 
p. Block 170 also has an "empty" output that indicates when combinations of programs running on the different proces- 
no pointers remain enqueued. sors might not always be the same on each occasion or even 
In a preferred embodiment the RAM 130 is divided into known, the derailed temporal behavior of the memory 
two arrays, an address control array AC and a data array D. 30 access patterns by the program may not be known. For 
The address control array AC stores the control information example, two processes that run two different programs on 
202 and the address 204 for each entry and the data array D two different processors may have been written by different 
stores the data 206 for each entry. Dividing the non-blocking programmers and the source code may not even be available 
load buffer into these two arrays AC and D realizes a more for either or both programs. As a result this programming 
efficient chip layout than using a single array for both 35 approach for overcoming the unpredictable memory latency 
purposes. defeats the purpose of using a non-blocking load buffer. 

FIG. 10 illustrates the entire structure for a non-blocking which was designed to simplify the burden on the program- 
load buffer in an embodiment of the present invention. Each raer and the compiler by relaxing the scheduling constraints 
of the pending queues 114 0 mJt and the return queue 116 may under which memory operations can be issued while still 
be composed of copies of block 170 illustrated in FIG. 8 far 40 pcrmitring efficient utilization of the processor, 
example. The queue entries are stored in the AC and D FIG. 9(a) illustrates another embodiment of the present 
arrays 50 1 and 502, with pointers to the array locations in the invention for a multiple -priority version of the non-blocking 
flip-flops that compose the pending queues 11^ A . the return load buffer. In this embodiment the non-blocking load 
queue 116 and the free pool 112. For the embodiment buffer is a variation of the basic non-blocking load buffer 
illustrated in FKj. 10. the write ports to the AC and D arrays 45 where nniltiple pending sub-queues 214q, 214*. . . . 214 p 
are time-division multiplexed using multiplexers 301 and exist as a component of each pending queue 114 0 „. The 
302 to allow multiple concurrent accesses from the proces- outputs of the sub-queues 214^ are then input to the 
sors 30 0 Jn , which issue load and store requests over bus 41. sub-multiplexers 218 associated with one of the peripheral 
and the peripheral devices 50 o which return requested devices. Each output of the sub-multiplexers 218 are then 
load data over bus 62. Similarly, die read ports are time- so input to the main multiplexer 118. Each of these pending 
division multiplexed by multiplexer 303 between returning sub-queues is assigned a unique priority level. In the sim- 
load data going back to the processors 30 o ^ via bus 42 and plest implementation of this multiple-priority version of the 
issuing of load and store requests to the peripherals M via non-blocking load buffer, illustrated in FIG. 9{b). mere are 
bus 61. A processor request control unit 510 accepts the load two pending sub-queues 21^> and 214, for a peripheral 
and store requests from the processors 300 ^ a return queue 55 device with sub-queue 21 4q being assigned a high priority 
control unit 520 controls the data returned by the peripheral and sub-queue 214! assigned a low priority. The multiple 
devices 50 o „ to the return queue 116 and a pending queue priority non-blocking load buffer issues memory or periph- 
oontrol unit 530 controls the selection of the pending queue eral transactions in a highest -priority-first manner. In other 
U4o n . words, no transaction will be issued from a pending sub- 
When either a non-blocking load buffer or a conventional 60 queue for a given peripheral unless all of the higher priority 
single FIFO load buffering operation is used, the uncertainty sub-queues for that same peripheral are empty. For each 
and memory latency is further compounded because indi- peripheral device, requests are issued from the pending 
vidua! processors may transmit more than one memory sub-queue in a FIFO manner for memory transactions of the 
request at a nine as illustrated in FIG. 3(a) for example. In same priority. 

FIG. 3(a). processor A issues three non-blocking loads to a 65 The computational processes associated with the memory 

memory device after a first non-blocking load is issued by transactions mat take place at each of these relative priority 

processor B but before a second non-blocking load is issued levels execute on the processors 30 O-Dt . The priority levels 
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arc assigned by the processors based on the scheduling Process priority may also be used to arbitrate for the 
parameters of all the processes in the system The priority limited resources within the non-blocking load buffer itself, 
level of the process is designed to reflect the relative For example, in one enibodiment of the non-blocking load 
importance of the memory accesses with respect to allowing buffer, limited data storage memory is shared among all 
all of the processes to meet their deadlines if possible. The 5 peripheral devices and processors. When the total number of 
priority levels may also be applicable in a uni-processor slots among all of the non-blocking load buffer queues 
environment. For example, interrupt handlers may be pro- exceeds the number of non-blocking memory entries, it is 
vided to achieve low latency by using higher priority possible for a process at a low priority to prevent a higher 
memory transactions than the transaction being interrupted. priority process from completely utilizing one or more of its 
The priority level may be identified by adding a priority tag to pending sub-queues by using up these entries. The use of 
to the memory transaction. This priority tag is used to priority in allocating non-blocking memory entries can be 
channel the memory transaction into the pending sub-queue used to eliminate mis effect For example, the maximum 
with the matching priority level, thus the selection of the number of outstanding non-blocking memory transactions 
appropriate destination pending sub-queue for a non- may be specified for each of the available priorities, 
blocking memory transaction is a function of both the 15 Once an appropriate priority has been determined for a 
address and the priority level of the access. The priority tag process, the priority of its memory transactions might be 
may be stored with the other control information 202 in the specified to the non-blocking load buffer by employing any 
queue entry 201 corresponding to a given non-blocking of several techniques. For example, the priority of any given 
memory access. memory transaction might be determined by an operand to 
FIG. 11 illustrates a flow chart for the functions performed 20 the instruction that performs the memory access or the 
by the pending queue control unit in an erribodiment of the priority can be associated with certain bit combinations in 
present invention for the multiple-priority version of the either the virtual address, the physical address of the 
non-blocking load buffer. At step S10. the counter i for a memory or the physical address of the peripheral device 
pending queue is initialized to zero. At step S20, the counter accessed. Alternatively, the processors might be designed 
is compared to the number n of pending queues 114. If the 25 with programmable registers that set the priority of all 
counter is not equal to the total number of pending queues. memory accesses for each processor. Yet another possible 
the counter is incremented at step S30 and a determination technique is to store the memory access priority in the page 
is made at step S40 of whether the peripheral corresponding table of a virtual memory system and thus make the memory 
to the counter is ready. If the peripheral device is determined access priority a characteristic of specific memory pages, 
not to be ready at step S40. the process returns to step S20. 30 When these various techniques are combined with conven- 
However. if the peripheral corresponding to the counter is tional memory management and processor design 
ready at step S40. a pending queue is selected at step S50 techniques, memory priorities can be treated as privileged 
with the highest priority that also contains at least one resources. As a result, the operating system reserves the 
memory transaction. At step S60, memory transactions are highest priority levels for its own real-time tasks and there- 
issued in a FIFO order from the selected sub-queue and the 35 fore the user level programs are extremely limited in their 
process returns to step S20. If the counter is equal to the ability to disrupt important system real-time tasks, 
number of pending queues at S20. the process has been To maintain memory coherency in a multiple^riority 
completed for all the pending queues and step S10 is non-blocking load buffer, which is not always necessary, no 
returned to where the counter is initialized. load or store may be issued to any of the peripheral devices 
FIG- 3(fc) illustrates memory latency for the multiple- 40 while a load or store to the same address is outstanding or 
priority non-blocking load buffer. In this example, processor while an earlier load or store to the same address is pending. 
B's memory transactions are assigned a higher priority than The memory coherency may be maintained by time stamps 
processor A's memory transactions. Therefore, transaction combined with address comparison. However, memory 
B2 is delivered to the memory before transactions A2 and A3 coherency may also be maintained more simply by ensuring 
even though the request to begin transaction B2 arrived at 45 that requests of different priorities are sent to non- 
the non-blocking load buffer after requests A2 and A3. As a overlapping address segments within the same peripheral 
result, the latency for transaction B2. t^ is less in FIG. 3(fc) device. 

than 1*2 in FIG. 3(a). which illustrates a non-blocking load In the normal operation of prograrnmed I/O activity, the 

buffer that docs not offer the benefit of mulnple-priority processors 30 o mjm do not need to exactly schedule loads to 

scheduling. Using the multiple-priority version of the non- 50 achieve maximum throughput from the peripheral devices 

blocking load buffer. Processor B spends less time stalled but can instead burst out requests to the limits of the 

waiting for transaction B2 to complete as illustrated by the non-blocking load storage and. if properly programmed, 

comparison in FIGS. 3(a) and 3(fc). perform useful work while waiting for data to return. The 

These priority levels may be heuristic in nature. For non-blocking load buffer allows application code to use 

example, if using earliest deadline first (EDF) scheduling, a 55 progranimed I/O (memory mapped I/O) for achieving near 

process should not be assigned a lower priority than the maximum throughput from the I/O devices of varying 

priority of any process which has a more distant deadline. In speeds which reduces the effective latency of the I/O periph- 

general. this priority level is not necessarily fixed for each eral devices and relaxes the scheduling constraints on the 

process and the priority level may vary over time as the pcograirimed I/O. The non-blocking load buffer functions to 

demands of the real-time process or of other processes 60 rate-match the requests from the processor to the peripheral 

change. As another example of a priority selection devices and back and to act upon the priority of requests 

mechanism, if the load-use distance is known for a load from the processor to allow high priority requests to go 

instruction (as computed by a cornpiler), it can be used to set ahead of low priority requests already buffered, 

a priority for each individual memory access instruction. The non-blocking load buffer uses queues of pointers to 
(Higher load-use distances result in lower priorities.) In 65 centralize storage to increase storage density, parallel queues 
general, non real-time processes (the processes having no to implement requests of different priority and memory 

deadlines) are typically given the lowest priority. segment descriptors to determine priority. Accordingly. I/O 
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requests to a fast device need not wait behind requests to a 
slow device and requests from several processes of a signal 
processor running multiple processes are buffered so that the 
processor is not unnecessarily idled and the time to complete 
tasks is reduced. Because the multiple priority non-blocking 
load buffer has multiple pending sub-queues for each of the 
peripherals, a processor used in combination with this mul- 
tiple priority non-blocking load buffer is able to run real- 
time processes of varying deadlines by use of non-FIFO 
scheduling of memory and peripheral accesses. Also, the 
multiple priority non-blocking load buffer simplifies the 
burden on the programmer and the compiler by relaxing the 
scheduling constraints under which memory operations can 
be issued while still permitting efficient utilization of the 
processor. 

The invention being thus described it will be obvious that 
the same may be varied in many ways. Such variations are 
not to be regarded as a departure from the spirit and scope 
of the invention, and all such modifications as would be 
obvious to one skilled in the art are intended to be included 
within the scope of the following claims. 

What is claimed is: 

1. In a data processing system including one or more 
requesting processors that generate processor requests 
directed to one or more peripheral devices, a non-blocking 
load buffer, comprising: 

a plurality of variable depth pending queues correspond- 
ing to each one of the plurality of peripheral devices for 
queuing entries of processor requests; 

a variable length return queuing unit for queuing data 
returned from said peripheral devices in response to an 
outstanding processor request; and 

a free pool of entries for placing entries of the outstanding 
processor request, after the outstanding processor 
request is accepted by a corresponding peripheral 
device. 

2. A non-blocking load buffer according to claim 1. 
wherein said variable length return queuing unit comprises 
one variable length return queue. 

3. A non-blocking load buffer according to claim 1, 
wherein said variable length return queuing unit comprises 
a plurality of variable length return queues. 

4. A non-blocking load buffer according to claim 3. 
wherein each of said plurality of variable length return 
queues corresponds to each of said requesting processors. 

5. A non-blocking load buffer according to claim 3. 
wherein each of said plurality of variable length return 
queues corresponds to each of said peripheral devices. 

6. A non-blocking load buffer according to claim 3, 
wherein each of said plurality of variable length return 
queues corresponds to a unique priority level. 

7. A non-blocking load buffer according to claim 1, 
wherein each of said pending queues include a plurality of 
sub-queues, wherein each of said sub-queues is assigned a 
unique priority leveL 

8. A non-blocking load buffer according to claim 7. 
wherein said processor requests include an address and a 
priority tag. said address directs said processor requests to a 
corresponding one of said pending queues and said priority 
tag channels said processor requests to a corresponding one 
of said sub-queues within the one said pending queue. 

9. A non-blocking load buffer according to claim 8. 
wherein each of said pending queues comprises a priority 
controller for issuing the processor requests from said sub- 
queues in a highest priority first manner. 
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10. A multiple priority non-blocking load buffer compris- 
ing: 

a variable depth pending queue for queuing entries of 
memory or I/O requests generated by a processor to 
peripheral devices, said pending queue including a 
plurality of sub-queues with each sub-queue having a 
unique priority level assigned thereto; 

a variable length return queuing unit for queuing data 
io returned from said peripheral devices in response to an 

outstanding memory or I/O request; and 
a free pool of entries for placing entries of the outstanding 

I/O request, after the outstanding I/O request is 

accepted. 

15 1L A multiple priority non-blocking load buffer according 
to claim 10. wherein said variable length return queuing unit 
comprises one variable length return queue. 

12. A multiple priority non-blocking load buffer according 
to claim 10, wherein said variable length return queuing unit 

20 comprises a plurality of variable length return queues. 

13. A multiple priority non- blocking load buffer according 
to claim 12, wherein each of said plurality of variable length 
return queues corresponds to each of said peripheral devices. 

14. A multiple priority non-blocking load buffer according 
25 to claim 12, wherein each of said plurality of variable length 

return queues corresponds to a unique priority level. 

15. A multiple priority non-blocking load buffer according 
to claim 10. wherein said memory or I/O requests include a 
priority tag. said priority tag channels said memory or I/O 

30 requests to a corresponding one of said sub-queues. 

16. A multiple priority non-blocking load buffer according 
to claim 15. wherein said pending queue comprises a priority 
controller for issuing said memory or I/O requests from said 
sub-queues in a highest priority first manner. 

35 17. A non- blocking load buffer according to claim 7. 
wherein a rnaximum number of outstanding processor 
requests is specified for said unique priority level in each of 
said sub-queues which prevents entries of processor requests 
having low priority levels from using one of said sub-queues 

40 before entries of processor requests having higher priority 
levels. 

18. A non-blocking load buffer according to claim 1. 
wherein priorities corresponding to the entries of processor 
requests are determined by logical memory addresses, coo- 

45 trol bits derived from a memory management page table, 
control bits derived from segmentation entries, virtual 
addresses of a memory management system, programmable 
registers which set priorities for each processor, instructions, 
or instruction operand. 

so 19. A multiple priority non-blocking load buffer according 
to claim 10, wherein a maximum number of outstanding 
memory or I/O requests is specified for said unique priority 
level in each of said sub-queues which prevents entries of 
memory or I/O requests having low priority levels from 

55 using one of said sub-queues before entries of memory or 
I/O requests having higher priority levels. 

20. A multiple priority non-blocking load buffer according 
to claim 10. wherein priorities corresponding to entries of 
memory or requests are determined by logical memory 

60 addresses, control bits derived from a memory management 
page table, control bits derived from segmentation entries, 
virtual addresses of a memory management system, pro- 
grammable registers which set priorities for each processor, 
instructions, or instruction operand. 

65 21. A non-Mocking load buffer according to claim 1. 
wherein the variable depth pending queues queue the entries 
of processor requests using the free pool of entries. 
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22. The data processing system according to claim 1. 
wherein the non-blocking processor requests are generated 
by running real-time processes. 

23. The data processing system according to claim 1, 
wherein the processor requests are prioritized in the pending 
queues such that the higher priority processor requests are 
processed before the lower priority processor requests. 

24. The data processing system according to claim 23. 
wherein the pending queues are prioritized such that the 
higher priority pending queues have a higher number of 
maximum entries than the lower priority pending queues. 

25. The data processing system according to claim 1, 
wherein the one or more requesting processors generate the 
processor requests over a first bus, and wherein the periph- 
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eral devices accept the processor requests over a second bus 
that has a different bus bandwidth from the first bus. 

26. The data processing system according to claim 25, 
wherein the entries include pointers that point to memory 
locations on a shared memory device. 

27. The data processing system according to claim 26. 
wherein the processor requests include control information, 
addresses and data, and wherein the shared memory device 

, is partitioned for storing the address and control information 
in a first memory array and for storing the data in a second 
separate memory array. 
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ASIC 



Pronounced ay-sik, 
and short for 
Application-Specific 
Integrated Circuit, a 
chip designed for a 
particular application 
(as opposed to the 
integrated circuits that 
control functions such 
as RAM in a PC). 
ASICs are built by 
connecting existing 
circuit building blocks 
in new ways. Since the 
building blocks 
already exist in a 
library, it is much 
easier to produce a 
new ASIC than to 
design a new chip 
from scratch. 

ASICs are commonly 
used in automotive 
computers to control 
the functions of the 
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IN THE UNITED STATES DISTRICT COURT 
FOR THE EASTERN DISTRICT OF TEXAS 
MARSHALL DIVISION 

MICROUMTY SYSTEMS ENGINEERING, § 
INC., § 
Plaintiff, § 

§ CIVIL ACTION NO. 2-04-CV-120 (TJW) 

§ 

§ 

DELL, INC. and INTEL CORPORATION, § 
Defendants. § 

MEMORANDUM OPINION AND ORDER 

After considering the submissions and the arguments of counsel, the court issues the 
following order concerning the claim construction issues: 
I. Introduction 

In this patent infringement suit, Plaintiff Microunity Systems Engineering, Inc. accuses 
Defendants Dell, Inc. and Intel Corporation of infringing eight United States patents. U.S. Patent 
Nos. 5,742,840 ( "the '840 patent"), 5,794,060 ("the c 060 patent"), 5,809,321 ("the c 321 patent"), 
5,794,061 ("the '061 patent"), 6,584,482 ("the '482 patent"), 6,643,765 ("the '765 patent"), and 
6,725,356 ("the '356 patent") disclose a method and apparatus for the processing of multi-media 
digital communications. Collectively, these seven patents are referred to as the "media processor 
patents." 1 The plaintiff also asserts U.S. Patent No. 5,630,096 ("the '096 patent") against the 
defendants. The '096 patent discloses "a controller for maximizing throughput of memory requests 
from an external device to a synchronous DRAM " 

The invention disclosed in the '840, '060, '061. and '321 patents generally "relates to the 
field of communications processing, and more particularly, to a method and apparatus for real-time 



1 The '840, '060, '061, and '321 patents share a common specification. The '765 and 
'356 patents also share a common specification. 
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processing of multi-media digital communications." '840 patent, col. 1, 11. 15-17. The '765 and 
'356 patents relate to general purpose processor architectures, and in particular, wide operand 
architectures. '765 patent, col. 1,11. 19-21. These media processor patents disclose microprocessor 
designs that arc capable of processing different types of media data, including audio, video, and 
graphics data, at very high volume in real-time. 
II. Law Governing Claim Construction 

"A claim in a patent provides the metes and bounds of the right which the patent confers on 
the patentee to exclude others from making, using or selling the protected invention." Burke, Inc. v. 
Bruno Jndep. Living Aids, Inc. . 1 83 F. 3d 1 334, 1 340 (Fed. Cir. 1 999). Claim construction is an issue 
of law for the court to decide. Markman v. Westview Instruments, Inc., 52 F.3d 967, 970-71 (Fed. 
Cir. 1995) (en banc), aff'd y 517 U.S. 370 (1996). 

To ascertain the meaning of claims, the court looks to three primary sources: the claims, the 
specification, and the prosecution history. Markman, 52 F.3d at 979. Under the patent law, the 
specification must contain a written description of the invention that enables one of ordinary skill 
in the art to make and use the invention. A patent's claims must be read in view of the specification, 
of which they are a part. Id. For claim construction purposes, the description may act as a sort of 
dictionary, which explains the invention and may define terms used in the claims. Id. "One purpose 
for examining the specification is to determine if the patentee has limited the scope of the claims." 
Watts v. XL Sys., Inc., 232 F.3d 877, 882 (Fed. Cir. 2000). 

Nonetheless, it is the function of the claims, not the specification, to set forth the limits of 
the patentee's claims. Otherwise, there would be no need for claims. SRIInt'l v. Matsushita Elec. 
Corp., 775 F.2d 1107, 1121 (Fed. Cir. 1985) (en banc). The patentee is free to be his own 

2 
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lexicographer, but any special definition given to a word must be clearly set forth in the 
specification. Intellicall, Inc. v. Phonometrics, 952 F.2d 1384, ] 388 (Fed. Cir. 1992). And, although 
the specification may indicate that certain embodiments are preferred, particular embodiments 
appearing in the specification will not be read into the claims when the claim language is broader 
than the embodiments. Electro Med. Sys., S.A. v. Cooper Life Sciences, Inc., 34 F.3d 1048, 1054 
(Fed. Cir. 1994). 

This court's claim construction decision must be informed by the Federal Circuit's recent 
decision in Phillips v. A WH Corporation, 2005 WL 1 62033 1 (Fed. Cir. July 1 2, 2005)(en banc). In 
Phillips, the court set forth several guideposts that courts should follow when construing claims. Jn 
particular, the court reiterated that "the claims of a patent define the invention to which the patentee 
is entitled the right to exclude." 2005 WL 1620331. at *4 (emphasis added)(quoting Jnnova/Pure 
Water Inc. v. Safari Water Filtration Systems, Inc., 381 F.3d 1111,1 1 15 (Fed. Cir. 2004)). To that 
end, the words used in a claim are generally given their ordinary and customary meaning. Id. at *5. 
The ordinary and customary meaning of a claim term "is the meaning that the term would have to 
a person of ordinary skill in the art in question at the time of the invention, i.e. as of the effective 
filing date of the patent application." Id. This principle of patent law flows naturally from the 
recognition that inventors are usually persons who are skilled in the field of the invention. The 
patent is addressed to and intended to be read by others skilled in the particular art. Id. 

The primacy of claim terms notwithstanding, Phillips made clear that "the person of ordinary 
skill in the art is deemed to read the claim term not only in the context of the particular claim in 
which the disputed term appears, but in the context of the entire patent, including the specification." 
Id. Although the claims themselves may provide guidance as to the meaning of particular terms, 

3 
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those terms are part of "a fully integrated written instrument." Id. at **6-7 {quoting Markman, 52 

F.3d at 978). Thus, the Phillips court emphasized the specification as being the primary basis for 

construing the claims. Id. at **7-8. As the Supreme Court stated long ago, "in case of doubt or 

ambiguity it is proper in all cases to refer back to the descriptive portions of the specification to aid 

in solving the doubt or in ascertaining the true intent and meaning of the language employed in the 

claims" Bates v. Cue, 98 U.S. 31 , 38 (1878). In addressing the role of the specification, the Phillips 

court quoted with approval its earlier observations from Renishaw PLC v. Marposs Socieia' per 

AzionU 158 F.3d 1243, 1250 (Fed. Cir. 1998): 

Ultimately, the interpretation to be given a term can only be determined and 
confirmed with a full understanding of what the inventors actually invented and 
intended to envelop with the claim. The construction that stays true to the claim 
language and most naturally aligns with the patent's description of the invention will 
be, in the end, the correct construction. 

Consequently, Phillips emphasized the important role the specification plays in the claim 

construction process. 

The prosecution history also continues to play an important role in claim interpretation. The 
prosecution history helps to demonstrate how the inventor and the PTO understood the patent. 
Phillips, 2005 WL 1620331 at *9. Because the file history, however, "represents an ongoing 
negotiation between the PTO and the applicant," it may lack the clarity of the specification and thus 
be less useful in claim construction proceedings. Id. Nevertheless, the prosecution history is 
intrinsic evidence. That evidence is relevant to the determination of how the inventor understood 
the invention and whether the inventor limited the invention during prosecution by narrowing the 
scope of the claims. 

Phillips rejected any claim construction approach that sacrificed the intrinsic record in favor 

4 
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of extrinsic evidence, such as dictionary definitions or expert testimony. The en banc court 
condemned the suggestion made by Texas Digital Systems, Inc. v. Telegenix, Inc., 308 F.3d 1 193 
(Fed. Cir. 2002), that a court should discern the ordinary meaning of the claim terms (through 
dictionaries or otherwise) before resorting to the specification for certain limited purposes. Id. at 
** 13-14. The approach suggested by Texas Digital-\hz assignment of a limited role to the 
specification- was rejected as inconsistent with decisions holding the specification to be the best 
guide to the meaning of a disputed term. Id. According to Phillips, reliance on dictionary definitions 
at the expense of the specification had the effect of "focus[ing] the inquiry on the abstract meaning 
of words rather than on the meaning of the claim terms within the context of the patent." Id. at * 1 4. 
Phillips emphasized that the patent system is based on the proposition that the claims cover only the 
invented subject matter. Id. What is described in the claims flows from the statutory requirement 
imposed on the patentee to describe and particularly claim what he or she has invented. Id. The 
definitions found in dictionaries, however, often flow from the editors' objective of assembling all 
of the possible definitions for a word. Id. 

Phillips does not preclude all uses of dictionaries in claim construction proceedings. Instead, 
the court assigned dictionaries a role subordinate to the intrinsic record. In doing so, the court 
emphasized that claim construction issues are not resolved by any magic formula. The court did not 
impose any particular sequence of steps for a court to follow when it considers disputed claim 
language. Id. at *16. Rather, Phillips held that a court must attach the appropriate weight to the 
intrinsic sources offered in support of a proposed claim construction, bearing in mind the general rule 
that the claims measure the scope of the patent grant. The court now turns to a description of the 
technology in the case, followed by an assessment of the terms and phrases in dispute. 
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III. Discussion 

The parties refer to seven of the patents as the media processor patents. In general, these 
patents describe an invention designed to consolidate the functions of several separate processors 
into a single processor for processing multi-media digital communications. As described in the 
abstract, the inventions relate to a general purpose, programmable media processor for processing 
and transmitting a media data stream of audio, video, radio, graphics, encryption, authentication, and 
networking information in real-time. '840 patent, Abstract. In the patents, the inventors noted 
certain shortcomings of application specific integrated circuits ("ASICs") as a solution to the 
efficient processing of broadband media data. The solution to using a number of different ASICs 
was to develop a single unified media processor. By combining the functionality of the various 
ASICs into a single processor, referred to as a media processor, the patentees sought to avoid the 
various problems associated with the use of a multiplicity of application-specific circuits. 

The media processor incorporates the functionality of the different ASICs into a processor 
that can process the different media data types having different data sizes. The execution and 
arithmetic units can process different data types and sizes, at least in part, because of an ability 
referred to as "dynamic partitioning." The incoming data, which may be of different types, can be 
divided in accordance with the type of data it is. As the input data type changes (e.g. from video to 
audio), the data size may change, requiring a different partitioning. Because the execution and 
arithmetic units can process media data of different types and sizes, there is no time where a 
substantial amount of silicon "real estate" is idle. Moreover, the media processor is very flexible 
because the execution and arithmetic units are not limited to operating on any one particular data 
type, but can accommodate media data of different types and sizes. Finally, to maximize throughput 

6 
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of the media processor, the execution and arithmetic units can perform operations on a group of data 
for each instruction, rather than each operation requiring a separate instruction. 

In addition to the media processor patents, the plaintiff has asserted the fc 096 patent against 
the defendants. The '096 patent describes a memory controller for maximizing throughput of 
memory requests. The controller provides an interface between the memory and other devices, such 
as a processor. The controller enables reading data from, and writing data to, the memory by, for 
example, the processor. The controller can reorder memory requests and/or otherwise sort memory 
requests to make maximum use of the available opportunities for data transfer. 

A. Disputed Terms of the Media Processor Patents 

1. A plurality of media data streams 

The first group of disputed claim terms involves the media data streams processed by the 
processor. The plaintiff contends that "a plurality of media data streams" means "two or more 
different types of media data streams such as audio, video, radio, graphics, encryption, 
authentication, and/or network information." The defendants urge that "a plurality of media data 
streams" means "two or more concurrent streams of audio, video, radio, graphics, encryption, 
authentication, and/or network media data information from two or more sources." At issue is the 
defendants' use of the word "concurrent" in their proposed construction. After considering the 
submissions of counsel and the intrinsic record, the court is persuaded that the plaintiffs position 
is correct. Accordingly, the court construes "a plurality of media data streams" to mean "two or 
more different types of media data streams." 

2. Unified media data streams 

The court construes "unified media data streams" to mean "combined media data streams of 

7 
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different types." 

3- Unified execution of multiple media data streams 

The plaintiff proposes that "unified execution of multiple media data streams" be construed 
to mean "processing two or more different types of media data streams by the same execution unit." 
On the other hand, the defendants submit that the disputed phrase means "operating on two or more 
media data streams in parallel, all at the same time with the same processor, without external 
specialized processors." The parties' primary dispute is whether media data streams are processed 
"in parallel," "all at the same time," and "without external processors." The defendants correctly 
note that the general purpose media processor claimed in the patents was an improvement over the 
combination of prior art specialized processors; however, the court does not read the claim language 
or the specification to exclude all use of external processors. What is required is that the media 
processor has the capability of processing two or more different types of media data streams by the 
same execution unit. The court is persuaded that the plaintiffs proposed construction is correct. 
Accordingly, the court construes "unified execution of multiple media data streams" to mean 
"processing two or more different types of media data streams by the same execution unit." 
4. Capable of dynamic partitioning 

All of the asserted claims of the media processor patents recite the phrase "dynamic 
partitioning" or a similar limitation. The plaintiff' urges that "capable of dynamic partitioning" 
means "able to divide the data into separate and distinct operands on an instruction by instruction 
basis." The defendants contend that the disputed phrase means "capable of dividing width-wise into 
a variable number of elements for simultaneous parallel processing of any combination or 
permutation of media data types in any size." 

8 
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The parties 5 briefs and their hearing presentation focused on whether the court could (or 
should) consider an Appendix filed with the PTO in connection with this term. The court has 
concluded that resort to the Appendix is unnecessary, and that the specification is sufficiently 
illuminating to permit construction of this term. The specification of the '840 patent shows a Table 
L Table I is an illustration of an instruction set for the media processor. That table illustrates data 
can be partitioned into byte-sized portions. Table 1 does not show partitioning of data on any basis 
other than a byte, or 8-bit level, nor does it show an instruction set capable of operating on data 
divided into different partition widths in the data path for any given 32-bits of data. Although Table 
1 does not necessarily exclude a processor with the capability to partition data to the extent required 
by the defendants' proposed construction, Table I is more consistent with the plaintiffs proposed 
definition, as it illustrates precisely the type of dynamic partitioning that the plaintiffs construction 
permits. After considering the submissions of counsel and the intrinsic record, the court construes 
"capable of dynamic partitioning" to mean "capable of dividing width-wise into a variable number 
of elements." 

5. Capable of dynamic partitioning based on the elemental width of data 
received from the data path, the elemental width being equal to or 
narrower than the data path 

For the same reasons advanced above, the court construes "capable of dynamic partitioning 

based on the elemental width of data received from the data path, the elemental width being equal 

to or narrower than the data path" to mean "capable of dividing width-wise into a variable number 

of elements no wider than the data path, based upon the size of the data elements received from the 

data path." 
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6. Partitioning first and second registers into a plurality of floating point 
operands, said floating point operands having a defined bit width, 
wherein said defined bit width is dynamically variable 

The court construes "partitioning first and second registers into a plurality of floating point 

operands, said floating point operands having a defined bit width, wherein said defined bit width is 

dynamically variable" to mean "dividing a first and a second register width-wise into a variable 

number of floating point operands based upon a variable width of the floating point element" 

7. Dynamically partition data received from the data path to account for 
an elemental width of the data wherein the elemental width of the data 
is equal to or narrower than the data path 

The court construes "dynamically partition data received from the data path to account for 

an elemental width of the data wherein the elemental width of the data is equal to or narrower than 

the data path" to mean "dividing width-wise into a variable number of elements no wider than the 

data path, based upon the size of the data elements received from the data path." 

8. Dynamically partitionable arithmetic unit 

The court construes "dynamically partitionable arithmetic unit" to mean "the arithmetic unit 
can be divided into a variable number of elements." 

9. Unified media processing 

The court now turns to the disputed terms relating to the media processor limitations. The 
plaintiff urges that "unified media processing," as used in the preamble, is not a limitation. Rather, 
the plaintiff contends that this is a statement of intended use. The defendants contend, however, that 
"unified media processing" is a limitation, and means "processing a media data stream using parallel 
processing and utilizing the entire width of the data path through dynamic portioning, without 
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external specialized processors/' 

After considering the submissions of counsel, the court concludes that '"unified media 

processing" is not a claim limitation. The phrase "unified media processing" appears in the preamble 

of claim 1 of the '321 patent, which provides "[a] system for unified media processing, comprising 

. . ." '321 patent, claim 1 . The rule in the Federal Circuit is that "[i]n general, a preamble limits the 

invention if it recites essential structure or steps, or if it is 'necessary to give life, meaning, and 

vitality' to the claim." Catalina Mktg, Int 7 v. Coolsavings. com, 289 F.3d 80 1 , 808 (Fed. Cir. 2002). 

The phrase "unified media processing" does not appear in the body of the claim, is not necessary 

to give life, meaning and vitality to the claim, and does not provide an antecedent basis for the term 

"media processor," which does appear in the claim. The court therefore concludes that "unified 

media processing" is not a claim limitation. No construction of this phrase is necessary. 

10. Genera] purpose media processor/ General purpose programmable 
media processor/ Media processor/ Programmable media processor/ 
General purpose [multiple precision parallel operation] programmable 
media processor 

Next, the court turns to the construction of the media processor terms. The plaintiff contends 
that the "general purpose media processor" is "a general purpose processor that also docs media 
processing." The defendants urge that "general purpose media processor" means "a single 
programmable processor that processes multiple media data streams without external specialized 
processors." There are two disputes between the parties regarding the construction of this term and 
the remaining "media processor" terms. The primary dispute between the parties is whether the 
"media processor" operates without "external specialized processors." A second issue is whether 
the term "programmable media processor" is a claim limitation. 

II 
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After considering the submissions and arguments of counsel, the court construes each of 
the disputed phrases above to mean "a processor having an execution unit capable of operating on 
different media types and data sizes." The court is not persuaded that the claim language or the 
specification necessarily excludes the presence of other, external processors, from the scope of the 
claim, as long as the processor itself has the requisite capabilities. With respect to the term 
"programmable media processor," the court concludes that this term is not a claim limitation for the 
same reasons provided with respect to the term "unified media processing." 
11. Multi-precision arithmetic unit 

The parties largely agree on the construction of "multi-precision arithmetic unit" with one 

exception. The plaintiffs contend that the "multi-precision arithmetic unit" is "a unit that can 

perform addition, subtraction, multiplication, division, and other integer and floating point arithmetic 

operations on data streams of varying sizes." The defendants object to the plaintiffs proposed 

construction on two grounds. First, the defendants contend that the term "unit" refers to a defined 

circuit block, and not circuitry distributed across the media processor, as the plaintiff argues. 

Second, the defendants further contend that the language "other integer and floating point arithmetic 

operations" should not be included in the construction. Thus, the defendants propose the following 

construction of "multi-precision arithmetic unit" - "a unit that can perform addition, subtraction, 

multiplication, division, and other arithmetic operations on data streams of varying sizes." The fc 840 

patent provides as follows: 

Many of the logic blocks themselves can also replaced [sic] with a single multi- 
precision arithmetic unit, which can be internally partitioned under software control 
to perform addition, multiplication, division, and other integer and floating point 
arithmetic operations on symbol streams of varying widths while sustaining the full 
data throughput of the memory hierarchy. 

12 
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'840 patent, col. 2, 11. 58-65. Based on the cited portion of the specification, the court is 
persuaded that the plaintiffs construction is correct and adopts it. The court declines to further 
define "unit" to require a single circuit block. 

12. Multi-precision execution unit 

The court construes "multi-precision execution unit" to mean "a unit that receives 
instructions and executes the instructions to perform simultaneous parallel operations on the plurality 
of media data streams, each of a width up to the width of the data path." 

13. Operable to perform unique operations on each component symbol 
The plaintiff proposes that "operable to perform unique operations on each component 

symbol" should be construed to mean "capable of performing a distinct operation on each component 
of a data unit." The defendants urge that the disputed phrase means "able to simultaneously perform 
different operations on each partitioned item of data." At issue is the meaning of the term "unique." 
The plaintiff contends it is sufficient that the multi-precision execution unit performs the same , 
single chosen type of operation (e.g., multiply) on each component symbol, albeit in different, 
separate, and distinct instances." See Plaintiffs Reply Brief at 33. The defendants insist, however, 
that the plaintiffs construction should be rejected because it does not reflect the definition of the 
term "unique." After considering the submissions of counsel, the court concludes that the plaintiff 
is correct. The court defines this term to mean "operable to perform unique operations on each 
component symbol" to mean "capable of performing a distinct operation on each component of a 
data unit." 

14. A switch coupled to the data path and programmable to manipulate data 
received from the data path, the switch providing data streams to the 
data path 
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The plaintiff asserts that this phrase should be construed to mean the following: "a routing 
device that is: (1) coupled to and receives data from the data path. (2) rearranges the data fields 
received from the data path in different ways in response to instructions by performing operations 
such as deals, shuffles, shifts, expands, compresses, swizzles, permutes, and reverses, and (3) 
provides the rearranged data fields to the data path." The defendants, however, urge that the disputed 
phrase should be construed more broadly to "hardware and/or software that performs data handling 
operations on unified media streams." They argue that the plaintiffs proposed construction imports 
limitations from the specification by requiring that the switch perform operations such as "deals, 
shuffles, shifts, expands, compresses, swizzles, permutes, and reverses." In the context of these 
patents, the court construes "a switch coupled to the data path and programmable to manipulate data 
received from the data path, the switch providing data streams to the data path" to mean "a routing 
device that is: (1) coupled to and receives data from the data path, (2) rearranges the data fields 
received from the data path in different ways in response to instructions, and (3) provides the 
rearranged data fields to the data path." 

15. Manipulating component fields 

The court construes "manipulating component fields" to mean "rearranging the data fields 
received from the data path in different ways." 

16. Group data handling operations 

The court construes "group data handling operations" to mean "data handling operations 
applied to a group of partitioned fields." 

17. Register controllable cross bar switch 

The plaintiff asserts that "register controllable cross bar switch" means "a routing device that 
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selectively couples a plurality of outputs to a plurality of inputs under the control of the contents of 
a register." The defendants propose that "register controllable cross bar switch" means "a switch 
which can independently connect any input to any output, and that is controlled through the use of 
hardware storage locations in the media processor that arc available to the user/programmer/' At 
issue is whether the "cross bar switch" must be able to connect any input with any output. 

The plaintiff contends that "switch 104," which is described in the specification of the c 061 
patent, is the cross-bar switch recited in the claims. According to the plaintiff, the specification does 
not require that "switch 104" be able to connect any input with any output. The defendants, on the 
other hand, argue that "cross bar switch" is a term of art that means "a switch that allows any input 
to be connected to any output " The defendants note that the term "cross bar" does not appear in the 
specifications of the media processor patents and that there is no indication that the patents use the 
term "cross bar" in a manner different from its ordinary meaning. After considering the submissions 
of counsel, the court construes "register controllable cross bar switch" to mean "a routing device that 
selectively couples a plurality of outputs to a plurality of inputs under the control of the contents of 
a register." 

18. Extended mathematical element 

The plaintiff proposes that "extended mathematical element" be construed to mean "a unit 
that performs additional mathematical operations that are specialized operations for efficient media 
processing." The defendants object to the plaintiffs use of the language "specialized operations for 
efficient media processing." They also contend that the specification of the '840 patent provides 
that "operations performed by the extended mathematical unit are 'higher level' than those 
performed by the ALU . . " Defendants 5 Sur-reply Brief at 13. Thus, the defendants urge that 
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"extended mathematical element" be construed to mean "a unit that performs higher level 
mathematical operations than the arithmetic unit." After considering the submissions of counsel, 
the court construes "extended mathematical element" to mean "a unit that performs additional 
mathematical operations other than addition, subtraction, multiplication, division, and other floating 
point operations." 

19. An extended mathematical element coupled to the data path and 
programmable to implement additional mathematical operations at 
substantially peak data throughput 

The court construes "an extended mathematical element coupled to the data path and 

programmable to implement additional mathematical operations at substantially peak data 

throughput" to mean "a programmable unit coupled to the data path that performs additional 

mathematical operations other than addition, subtraction, multiplication, division, and other floating 

point operations at substantially peak data throughput " 

20. Table look-up . . . [operation] 

The court construes "table look-up . . . [operation]" to mean "an operation that uses a known 
value to locate an unknown value in a table." 

21. Storing the unified media data streams in a general register file 

The plaintiff submits that "storing the unified media data streams in a general register file" 
means "storing the unified media data streams in a set of registers, which may be addressed by their 
number in the set. in which the registers can be used for various purposes." The defendants assert 
that the disputed phrase means "storing the unified media data streams in a set of hardware storage 
locations that are available to the user/programmer for a wide variety of functions." Primarily at 
issue is whether "a general register file" must be available for a "wide variety" of functions. 

16 
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The plaintiff contends that the "general register file" need not be available for a wide variety 
of functions, but must be available for "different" or "various" purposes. The plaintiff asserts that 
its proposed construction is supported by the specifications of the '060 and '840 patents and that its 
construction would be understood by one ordinarily skilled in the art. According to the plaintiff, the 
patent specifications only require that the registers not be dedicated or specific purpose registers. 
The defendants, on the other hand, contend that the media processor patents use the term "general 
register file" consistent with its ordinary meaning, which requires that the register file be available 
for "a wide variety of functions." After considering the submissions of counsel, the court construes 
"storing the unified media data streams in a general register file" to mean "storing the unified media 
data streams in a set of hardware storage locations that are available to the user/programmer for 
various purposes." 

22. Multiple operands in partitioned fields of operand registers 

The court construes "multiple operands in partitioned fields of operand registers" to mean 
"more than one object upon which operations are performed, each object being stored in a separate 
and distinct field of a register." 

23. Storing partitioned data in registers 

The plaintiff contends that "storing partitioned data in registers" means "storing in registers 
the data that was divided into separateand distinct data fields." The defendants, on the other hand, 
argue that "storing partitioned data in registers" means "the results of the dynamic partitioning of 
the data stream is [sic] stored in adjacent portions of registers." The plaintiff disagrees with the 
defendant's proposed construction because it requires that the results of the dynamic partitioning be 
stored in adjacent portions of the register. According to the plaintiff, the specifications of the media 
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processor patents disclose storing partitioned data in non-adjacent portions. The defendants contend 
that a person of ordinary skill in the art would understand that the data are stored in adjacent portions 
of registers. After considering the arguments of counsel, the court construes "storing partitioned data 
in registers" to mean "the results of the dynamic partitioning of the data stream arc stored in 
registers " 

24. High bandwidth external interface 

The plaintiff argues that "high bandwidth external interface" means "an interface between 
the media processor and external sources of data capable of operating at or near a rate that maintains 
substantially peak operation of the media processor." The defendants urge that "high bandwidth 
external interface" means "an interface between the media processor and external sources of data that 
operates at or near the peak data throughput rate of the execution units [sic] of the media processor." 
The court construes "high bandwidth external interface" to mean "an interface between the media 
processor and external sources of data that is capable of operating at or near the peak data throughput 
rate of the execution unit of the media processor." 

25. A high bandwidth external interface operable to receive a plurality of 
data of various sizes from an external source and communicate the 
received data over the data path at a rate that maintains substantially 
peak operation of the media processor 

After considering the submissions of counsel, the court concludes that this phrase requires 

no additional construction. 

26. High bandwidth interface 

After considering the submissions of counsel, the court concludes that this phrase requires 
no additional construction. 
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27. Substantially peak rates 

The court construes "substantially peak rates" to mean "simultaneous parallel processing 
using all or nearly all of the entire width of the data path." 

28. Data path 

The term "data path" is construed to mean "the buses and circuit elements that convey data." 

29. Bi-directional communication fabric 

The phrase "bi-directional communication fabric" means "an interprocessor communications 
network allowing communication in both directions." The specification suggests that the patentee 
used "network" and "fabric" interchangeably. See '840 patent, col. 6, ll. 16-19. At least one 
programmable media processor is provided within the communications network for receiving, 
processing, and transmitting the at least one stream of unified media data over the bi-directional 
communications fabric. 

30. Being capable of being represented by a defined bit width which is equal 
to said defined bit width of said operands 

The plaintiff proposes that this phrase means "able to be represented by a data field of the 

same size as the floating point operands." The defendants contend that the disputed phrase means 

"the bit width of the product of the permissible floating point operands must be no greater than the 

bit width of each of the operands." The plaintiff objects to the defendants 5 construction because of 

the limitation "no greater than the bit width of each of the operands." The claim language, the 

plaintiff argues, requires the capability of being represented by a bit width equal to the bit width of 

the operands. The defendants insist, however, that their "no greater than" limitation is entirely 

consistent with the claim language and is found in the specification of the '482 patent. After 
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considering the submissions of counsel, the court adopts the plaintiffs construction and constaies 
"being capable of being represented by a defined bit width which is equal to said defined bit width 
of said operands" accordingly. 

31. Group floating point operations 

The court constmes "group floating point operations" to mean ''floating point operations 
applied to a group of partitioned operands." 

32. Dedicated memory 

The term "dedicated memory" appears in all of the claims of the '321 patent. In its reply 
brief, the plaintiff proposes that "dedicated memory" should be construed to mean "memory that is 
within the media processor that is accessible only through memory circuitry associated with the 
media processor." The defendants contend that under the plaintiffs construction, the memory is 
not dedicated because it is accessible by circuitry associated with the media processor and associated 
with another processor. Thus, the defendants insist that "dedicated memory" means "memory that 
is within the media processor that is accessible only through the media processor " Mindful that the 
claim language requires a "dedicated" memory, the court adopts the defendants'* construction of 
"dedicated memory" and construes this term accordingly. 

33. Boolean . . . mathematical operation 

The court construes "Boolean . . . mathematical operation" as follows: "A Boolean operation 
is an operation that applies formal logic (for example, 'AND.' 'OR, 5 'NOR,' etc.)." 
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B. Disputed Terms of the '096 Patent 

1. Throughput maximizing unit for processing said memory requests to the 
synchronous DRAM in response to scheduling which maximizes the use 
of data slots by the synchronous DRAM 

The court first considers whether § 1 J 2 H 6 applies. The plaintiff asserts that it docs not and 
that the phrase "throughput maximizing unit for processing said memory requests . . ." means "an 
element of the controller that processes memory requests in response to scheduling constraints of 
the synchronous DRAM which maximizes the use of data slots." In their response brief, the 
defendants contend that "maximizing throughput of said memory requests . . ." is a means-plus- 
function limitation. The defendants argue that "claims 1 and 3 fail to recite any structure for 
processing memory requests to maximize the use of data slots, and claims 11,13, and 20 fail to recite 
any acts prescribing how to maximize the use of data slots." Defendants' Response Brief at 71. The 
absence of the word "means" raises a presumption that § 1 12, ^ 6 does not apply. The court 
concludes that "throughput maximizing unit for processing said memory requests . . is not a 
means-plus-function limitation. The court therefore adopts the plaintiff s constmction and construes 
"throughput maximizing unit for processing said memory requests" as an "element of the controller 
that processes memory requests in response to scheduling constraints of the synchronous DRAM 
which maximizes the use of data slots." 

2. Throughput maximizing unit 

The plaintiff proposes that the term "throughput maximizing unit" is "an element of the 
controller that processes memory requests in response to scheduling constraints of the synchronous 
DRAM." The defendants contend that the term "throughput maximizing unit" is a means-plus- 
function limitation. After considering the arguments of counsel, the court concludes that the term 
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"throughput maximizing unit" is not a means-plus-function limitation. Accordingly, the court 
construes "throughput maximizing unit' 5 to mean "an element of the controller that processes 
memory requests in response to scheduling constraints of the synchronous DRAM, which maximizes 
throughput." 

3. Maximizing throughput of said memory requests to the synchronous 
DRAM so that use of the data slots by the synchronous DRAM is 
maximized 

The court adopts the plaintiffs construction of "maximizing throughput of said memory 
requests . . and construes this phrase to mean "scheduling memory requests to the synchronous 
DRAM to maximize throughput so that the use of data slots is maximized." 

4. Memory requests 

The term "memory requests" appears in all of the asserted claims of the '096 patent. The 
plaintiff proposes that "memory requests" are "requests from an external device, such as a processor, 
to a memory device." The defendants assert that "memory requests" are "requests from an external 
device such as a processor, to load data from or store data to the synchronous DRAM." The parties 
dispute whether "memory requests" must be directed to the synchronous DRAM. According to the 
defendants, "memory requests" are addressed only to the synchronous DRAM and not any other 
memory device. After considering the submissions of counsel, the court adopts the plaintiffs 
construction of "memory requests." 

5. Data slots 

The term "data slots" is recited in all of the asserted claims of the l 096 patent. The plaintiff 
submits that "data slots" are "times during which data may be transferred to or from the SDRAM." 
The defendants propose that "data slots" are "SDRAM clock cycles for transferring data." The 
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plaintiff contends that the defendants propose "a highly specific construction of data slot related to 
the SDRAM clock cycle" that is not supported by or disclosed in the specification. However, the 
defendants argue that the '096 patent makes clear that the "data slots correspond to SDRAM clock 
cycles in Figs. 4(a -c) and docs not disclose any other time period from which a data slot may be 
defined." Defendants' Sur-replyBriefat20. After considering the submissions of counsel, the court 
construes the term "data slots" to mean "SDRAM clock cycles available for transferring data." 
6- Interfacing a processing device with a synchronous DRAM 

The plaintiff proposes that "interfacing a processing device with a synchronous DRAM" 
means "reading and writing to the synchronous DRAM by a processing device." The defendants 
urge that the disputed phrase means "translating memory requests into synchronous DRAM 
commands." At issue is whether "interfacing" requires translation. In its reply brief, the plaintiff 
argues that the defendants' proposed construction introduces a notion of translation, which is 
inconsistent with the ordinary meaning of the term "interfacing." On the other hand, the defendants 
contend that the disclosed controller receives memory requests and translates them into SDRAM 
commands. The court construes the phrase "interfacing a processing device with a synchronous 
DRAM" to mean "enabling reading and writing to the synchronous DRAM by a processing device." 
7. Sorting said memory requests based on their addresses 

The plaintiff urges that "sorting said memory requests based on their addresses" means 
"segregating memory requests into one or more groups based on their addresses." The defendants 
argue that the disputed phrase means "segregating memory requests into two or more groups 
according to their addresses." At issue is the number of groups into which "memory requests" can 
be segregated. The court adopts the plaintiffs construction of "sorting said memory requests based 
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on their addresses" and construes this phrase accordingly. 

8. Means for developing memory requests from the processing device 

This means-plus-function limitation appears in claim 1 0 of the 4 096 patent. The parties agree 
that the court need only construe the corresponding structure. The plaintiff contends that the 
corresponding structure is "bank sort unit 10 and equivalent structures." The defendants urge that 
the corresponding structure is "command update unit 210, bank qualification unit 220, and Tables 
1-2." After considering the submissions of counsel, the court concludes that the corresponding 
structure consists of "bank sort unit 10 and equivalent structures." 

SIGNED this 26th day of August, 2005. 
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|57] ABSTRACT 

A general purpose, programmable media processor for pro- 
cessing and transmitting a media data stream of audio, 
video, radio, graphics, encryption, authentication, and net- 
working information in real-time. The media processor 
incorporates an execution unit that maintains substantially 
peat data throughout of media data streams. The execution 
unit includes a dynamically partionable multi-precision 
arithmetic unit, programmable switch and programmable 
extended mathematical element. A high bandwidth external 
interface supplies media data streams at substantially peak 
rates to a general purpose register file and the multi- 
precision execution unit. A memory management unit, and 
instruction and data cache/buffers are also provided. High 
bandwidth memory controllers are linked in series to pro- 
vide a memory channel to the general purpose, program- 
mable media processor. The general purpose, programmable 
media processor is disposed in a network fabric consisting of 
fiber optic cable, coaxial cable and twisted pair wires to 
transmit, process and receive single or unified media data 
streams. Parallel general purpose media processors are dis- 
posed throughout the network in a distributed virtual manner 
to allow for multi-processor operations and sharing of 
resources through the network. A method for receiving, 
processing and transmitting media data streams over the 
communications fabric is also provided. 
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GENERAL PURPOSE, MULTIPLE 
PRECISION PARALLEL OPERATION, 
PROGRAMMABLE MEDIA PROCESSOR 

This is a divisional of application Ser. No. 08/516.036. 
filed Aug. 16, 1995. now U.S. Pat. No. 5/742.340. 

A Microfiche Appendix consisting of 4 sheets (387 total 
frames) of microfiche is included in this application. The 
Microfiche Appendix contains material which is subject to 
copyright protection. The copyright owner has no objection 
to the facsimile reproduction by any one of the Microfiche 
Appendix, as it appears in the Patent and Trademark Office 
patent files or records, but otherwise reserves alt copyright 
rights whatsoever. 

FIELD OF THE INVENTION 

This invention relates to the field of communications 
processing, and more particularly* to a method and apparatus 
for real-time processing of multi-media digital communica- 
tions. 

BACKGROUND OF THE INVENTION 

Optical fiber and discs have made the transmission and 
storage of digital information both cheaper and easier than 
older analog technologies. An improved system for digital 
processing of media data streams is necessary in order to 
realize the full potential of these advanced media. 

For the past century, telephone service delivered over 
copper twisted pair has been the lingua franca of commu- 
nications. Over the next century, broadband services deliv- 
ered over optical fiber and coax will more completely fulfill 
the human need for sensory information by supplying voice, 
video, and data at rates of about 1.000 times greater than 
narrow band telephony. Current general-purpose micropro- 
cessors and digital signal processors ("DSPs**) can handle 
digital voice, data, and images at narrow band rates, but they 
are way too slow for processing media data at broadband 
rates. 

This shortfall in digital processing of broadband media is 
currently being addressed through the design of many dif- 
ferent kinds of application-specific integrated circuits 
("ASICs"). For example, a prototypical broadband device 
such as a cable modem modulates and demodulates digital 
data at rates up to 45 Mbits/sec within a single 6 MHZ cable 
channel (as compared to rates of 28.8 Kbits/sec within a 6 
KHz channel for telephone modems) and transcodes it onto 
a 10/100 base T connection to a personal computer ("PC) 
or workstation. Current cable modems thus receive data 
from a coaxial cable connection through a chain of special- 
ized ASIC devices in order to accomplish Quadrature 
Amplitude Modification ("QAM") demodulation. Reed- 
Solomon error correction, packet filtering. Data Encryption 
Standard ( M DEST) decryption, and Ethernet protocol han- 
dling. The cable modems also transmit data to the coaxial 
cable link through a second chain of devices to achieve DES 
encryption. Reed-Solomon block encoding, and Quaternary 
Phase Shift Keying ("QPSK") modulation. In these 
environments, a general -purpose processor is usually 
required as well in order to perform initialization, statistics 
collection, diagnostics, and network management functions. 

The ASIC approach to media processing has three fun- 
damental flaws: cost complexity, and rigidity. The com- 
bined silicon area of all the specialized ASIC devices 
required in the cable modem for example, results in a 
component cost incompatible with the per subscriber price 
target for a cable service. The cable plant itself is a very 



2 

hostile service environment, with noise ingress, reflections, 
nonlinear amplifiers, and other channel impairments, espe- 
cially when viewed in the upstream direction. Telephony 
modems have developed an elaborate hierarchy of algo- 

5 rithms implemented in DSP software, with automatic reduc- 
tion of data rates from 28.8 Kbits/sec to 19.6 Kbits/sec. 14.4 
Kbits/sec. or much lower rates as needed to accommodate 
noise, echoes, and other impairments in the copper plant. To 
implement similar algorithms on an ASIC-based broadband 

l0 modem is far more complex to achieve in software. 

These problems of cost, complexity, and rigidity arc 
compounded further in more complete broadband devices 
such as digital set-top boxes, multimedia PCs. or video 
conferencing equipment, all of which go beyond the basic 

IS radio frequency ("RF") modem functions to include a broad 
range of audio and video compression and decoding 
algorithms, along with remote control and graphical user 
interfaces. Software for these devices must control what 
amounts to a heterogeneous multi-processor, where each 

20 specialized processor has a different, and usually eccentric 
or r^imitive. r*ogramming environment Even if these pro- 
gramming environments are mastered, the degree of pro- 
gramma biliry is limited. For example. Motion Picture Expert 
Group-I ("MPEG I") chips manufactured by AT&T Corpo- 

25 ration will not implement advances such as fractal- and 
wavelet-based compression algorithms, but these chips are 
not readily software upgradeable to the MPEG-II standard. 
A broadband network operator who leases an MPEG ASIC- 
based product is therefore at risk of having to continuously 

30 upgrade his system by purchasing significant amounts of 
new hardware just to track the evolution of MPEG stan- 
dards. 

The high cost of ASIC-based media processing results 
from inefficiencies in both memory and logic. A typical 

35 ASIC consists of a multiplicity of specialized logic blocks, 
each with a small memory dedicated to holding the data 
which comprises the working set for that block The silicon 
area of these multiple small memories is further increased by 
the overhead of multiple decoders, sense amplifiers, write 

40 drivers, etc. required for each logic block. The logic blocks 
are also constrained to operate at frequencies determined by 
the internal symbol rates of broadband algorithms in order to 
avoid additional buffer memories. These frequencies typi- 
cally differ from the optimum speed-area operating point of 

45 a given semiconductor technology. Interconnect and syn- 
chronization of the many logic and memory blocks are also 
major sources of overhead in the ASIC approach. 

The disadvantages of the prior ASIC approach can be over 
come by a single unified media processor. The cost axfvaa- 

50 tages of such a unified processor can be achieved by 
gathering all the many ASIC functions of a broadband media 
product into a single integrated circuit Cost reduction is 
further increased by reducing the total memory area of such 
a circuit by replacing the multiplicity of small ASIC raerao- 

55 lies with a single memory hierarchy large enough to accom- 
modate the sum total of all the working seis. and wide 
enough to supply the aggregate bandwidth needs of all the 
logic blocks. Additionally, the logic block intercoanect 
circuitry to this memory hierarchy may be streamlined by 

60 providing a generally programmable switching fabric. Many 
of the logic blocks themselves can also replaced with a 
single multi-precision arithmetic unit which can be inter- 
nally partitioned under software control to perform addition, 
multiplication, division, and other integer and floating point 

65 arithmetic operations on symbol streams of varying widths, 
while sustaining the full data throughput of the memory 
hierarchy. The residue of logic blocks that perform opera- 
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tions that arc neither arithmetic or permutation group ori- 
ented can be replaced with an extended math unit that 
supports additional arithmetic operations such as finite field, 
ring, and table lookup, while also sustaining the full data 
throughput of the memory hierarchy. 

The above multi-precision arithmetic, permutation 
switch, and extended math operations can then be organized 
as machine instructions that transfer their operands to and 
from a single wide muki -ported register file. These instruc- 
tions can be further supplemented with load/store instruc- 
tions that transfer register data to and from a data buffer/ 
cache static random access memory (**SRAM"> and main 
memory dynamic random access memories ("DRAMs"). 
and with branch instructions that control the flow of instruc- 
tions executed from an instruction buffer/cache SRAM. 
Extensions to the load/store instructions can be made for 
synchronization, and to branch instructions for protected 
gateways, so that multiple threads of execution for audio, 
video, radio. encryption, networking, etc can efficiently and 
securely share memory and logic resources of a unified 
machine operating near the optimum speed-area point of the 
target semiconductor process. The data path for such a 
unified media processor can interface to a high speed 
input/output ("I/O") subsystem that moves media streams 
across ultra-high bandwidth interfaces to external storage 
and I/O. 

Such a device would incorporate all of the processing 
capabilities of the specialized multi-ASIC combination into 
a single, unified processing device. The unified processor 
would be agile and capable of reprogramming through the 
transmission of new programs over the communication 
medium. This programmable, general purpose device is thus 
less costly than the specialized processor combination, 
easier to operate and reprogram and can be installed or 
applied in many differing devices and situations. The device 
may also be scalable to communications applications that 
support vast numbers of users through massively parallel 
distributed computing. 

Ir is therefore an object of this invention to process media 
data streams by executing operations at very high bandwidth 
rates. 

It is also an object of this invention to unify the audio, 
video, radio, graphics, encryption, authentication, and net- 
working protocols into a single instruction stream 

It is also an object of this invention to achieve high 
bandwidth rates in a unified processor that is easy to 
program and more flexible than a heterogeneous combina- 
tion of special purpose processors. 

It is a further object of the invention to support high level 
mathematical processing in a unified media processor, 
including finite group, finite field, finite ring and table 
look-up operations, all at high bandwidth rates. 

It is yet a further object of the invention to provide a 
unified media processor that can be replicated into a multi- 
processor system to support a vast array of users. 

It is yet another object of this invention to allow for 
massively parallel systems within the switching fabric to 
support very large numbers of subscribers and services. 

It is also an object of the invention to provide a general 
purpose programmable processor that could be employed at 
all points in a network. 

It is a further object of this invention to sustain very high 
bandwidth rates to arbitrarily large memory and input/output 
systems. 

SUMMARY OF THE INVENTION 

Id view of the above, there is provided a system for media 
processing that maintains substantially peak data throughput 
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in the execution and transmission of multiple media data 
streams. The system includes in one aspect a general 
purpose, programmable media processor, and in another 
aspect includes a method for receiving, processing and 
5 transmitting media data streams. The general purpose, pro- 
grammable media processor of the invention further 
includes an execution unit, high bandwidth external 
interface, and can be employed in a parallel multi-processor 
system 

l0 According to the apparatus of the invention, an execution 
unit is provided that maintains substantially peak data 
throughput in the unified execution of multiple media data 
streams. The execution unit includes a data path, and a 
multi-precision arithmetic unit coupled to the data path and 

I5 capable of dynamic partitioning based on the elemental 
width of data received from the data path. The execution unit 
also includes a switch coupled to the data path that is 
programmable to manipulate data received from the data 
path and provide data streams to the data path. An extended 

2Q mathematical element is also provided, which is coupled to 
the data path and programmable to implement additional 
mathematical operations at substantially peak data through- 
put In a preferred embodiment of the execution unit, at least 
one register file is coupled to the data path. 

25 According to another aspect of the invention, a general 
purpose programmable media processor is provided having 
an instruction path and a data path to digitally process a 
plurality of media data streams. The media processor 
includes a high bandwidth external interface operable to 

30 receive a plurality of data of various sizes from an external 
source and communicate the received data over the datapath 
at a rate that maintains substantially peak operation of the 
media processor. At least one register file is included, which 
is configurable to receive and store data from the data path 

33 and to communicate the stored data to the data path. A 
multi-precision execution unit is coupled to die data path 
and is dynamically configurable to partition data received 
from the data path to account for the elemental symbol size 
of the plurality of media streams, and is programmable to 

40 operate on the data to generate a unified symbol output to the 
data path. 

According to the preferred embodiment of the media 
processor, means are included for moving data between 
registers and memory by performing load and store 

45 operations, and for coordinating the sharing of data among 
a plurality of tasks by performing synchronization opera- 
tions based upon instructions and data received by the 
execution unit Means are also provided for securely con- 
trolling the sequence of execution by performing branch and 

so gateway operations based upon instructions and data 
received by the execution unit A memory management unit 
operable to retrieve data and instructions for timely and 
secure communication over the data path and instruction 
path respectively is also preferably included in the media 

55 processor. The preferred embodiment also includes a com- 
bined instruction cache and buffer that is dynamically allo- 
cated between cache space and buffer space to ensure 
real-time execution of multiple media instruction streams, 
and a combined data cache and buffer that is dynamically 

60 allocated between cache space and buffer space to ensure 
real-time response for multiple media data streams. 

In another aspect of the invention, a high bandwidth 
processor interface for receiving and transmitting a media 
stream is provided having a data path operable to transmit 

65 media information at sustained peak rates. The high band- 
width processor interface includes a plurality of memory 
controllers coupled in series to communicate stored media 
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information to and from the data path, and a plurality of 
memory elements coupled in parallel to each of the plurality 
of memory controllers for storing and retrieving the media 
information. In the preferred embodiment of the high band- 
width processor interface, the plurality of memory control- 
lers each comprise a paired link disposed between each 
memory controller, where the paired links each transmit and 
receive plural bits of data and have differential data inputs 
and outputs and a differential dock signal. 

Yet another aspect of the invention includes a system for 
unified media processing having a plurality of general 
purpose media processors, where each media processor is 
operable at substantially peak data rates and has a dynami- 
cally partitioned execution unit and a high bandwidth inter- 
face for communicating to memory and input/output ele- 
ments to supply data to the media processor at substantially 
peak rates. A bi-directional communication fabric is 
provided, to which the plurality of media processors are 
coupled, to transmit and receive at least one media stream 
comprising presentation, transmission, and storage media 
information. The bi-directional communication fabric pref- 
erably comprises a fiber optic network, and a subset of the 
plurality of media processors comprise network servers. 

According to yet another aspect of the invention, a 
parallel multi- media processor system is provided having a 
data path and a high bandwidth external interface coupled to 
the data path and operable to receive a plurality of data of 
various sizes from an external source and communicate the 
received data at a rate mat maintains substantially peak 
operation of the parallel multi-processor system. A plurality 
of register files, each having at least one register coupled to 
the data path and operable to store data, are also included. At 
least one multi-precision execution unit is coupled to the 
data path and is dynamically configurable to partition data 
received from the data path to account for the elemental 
symbol size of the plurality of media streams, and is 
programmable to operate in parallel on data stored in the 
plurality of register files to generate a unified symbol output 
for each register file. 

According to the method of the invention, unified streams 
of media data are processed by receiving a stream of unified 
media data including presentation, transmission and storage 
information. The unified stream of media data is dynami- 
cally partitioned into component fields of at least one bit 
based on the elemental symbol size of data received. The 
unified stream of media data is men processed at substan- 
tially peak operation. 

In one aspect of the invention, the unified stream of media 
data is processed by storing the stream of unified media data 
in a general register file. Multi-precision arithmetic opera- 
tions can then be performed on the stored stream of unifi ed 
media data based on programmed instructions, where the 
multi-precision arithmetic operations include Boolean, inte- 
ger and floating point mathematical operations. The com- 
ponent fields of unified media data can then be manipulated 
based on programmed instructions that implement copying, 
shifting and re-sizing operations. Multi-precision math- 
ematical operations can also be performed on the stored 
stream of unified media data based on programmed 
instructions, where the mathematical operations including 
finite group, finite field, finite ring and table look-up opera- 
tions. Instruction and data pre -fetching are included to fill 
instruction and data pipelines, and memory management 
operations can be performed to retrieve instructions and data 
from external memory. The instructions and data are pref- 
erably stored in instruction and data cache/buffers, in which 
buffer storage in the instruction and data cache/buffers is 
dynamically allocated to ensure real-time execution. 
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Other aspects of the invention include a method for 
achieving high bandwidth communications between a gen- 
eral purpose media processor and external devices by pro- 
viding a high bandwidth interface disposed between the 

5 media processor and the external devices, in which the lugh 
bandwidth interface comprises at least one uni-directional 
channel pair having an input port and an output port. A 
plurality of media data streams, comprising component 
fields of various sizes, are transmitted and received between 

J0 the media processor and the external devices at a rate that 
sustains substantially peak data throughput at the media 
processor. A method for processing streams of media data is 
also included that provides a bi-directional communications 
fabric for transmitting and receiving at least one stream of 
media data, where the at least one stream of media data 

15 comprises presentation, transmission and storage informa- 
tion. At least one programmable media processor is provided 
within the communications network for receiving, process- 
ing and transmitting the at least one stream of unified media 
data over the bi-directional communications fabric. 

20 The general purpose, programmable media processor of 
the invention combines in a single device all of the necessary 
hardware included in the specialized processor combina- 
tions to process and communicate digital media data streams 
in real-time. The general purpose, programmable media 

25 processor is therefore cheaper and more flexible than the 
prior approach to media processing. The general purpose, 
programmable media processor is thus more susceptible to 
incorporation within a massively parallel processing net- 
work of general purpose media processors that enhance the 

30 ability to provide real-time multi -media communications to 
the masses. 

These features are accomplished by deploying server 
media processors and client media processors throughout the 
network. Such a network provides a seamless, global media 

35 super-computer which allows programmers and network 
owners to virtualize resources. Rather than restrictively 
accessing only the memory space and processing time of a 
local resource, the system allows access to resources 
throughout the network. In small access points such as 

40 wireless devices, where very little memory and processing 
logic is available due to limited battery life, the system is 
able to draw upon the resources of a homogeneous multi- 
computer system. 
The invention also allows network owners the facility to 

45 track standards and to deploy new services by broadcasting 
software across the network rather than by instituting costly 
hardware upgrades across the whole network. Broadcasting 
software across the network can be performed at the end of 
an advertisement or other program that is broadcasted 

50 nationally. Thus, services can be advertised and then trans- 
mitted to new subscribers at the end of the advertisement 
These and other features and advantages of the invention 
will be apparent upon consideration of the following 
detailed description of the presently preferred embodiments 

55 of the invention, taken in conjunction with the appended 
drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a broad band media computer 
60 employing the general purpose, programmable media pro- 
cessor of the invention; 

FIG. 2 is a block diagram of a global media processor 
employing multiple general purpose media processors 
according to the invention: 
65 FIG. 3 is an illustration of the digital bandwidth spectrum 
for telecommunications, media and computing communica- 
tions: 
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FIG. 4 is the digital bandwidth spectrum shown in FIG. 3 puter 10 is provided in FIG. 1. The broad band niicrocoro- 

taking into account the bandwidth overhead associated with puter 10 consists essentially of a general purpose media 

compressed video techniques; processor 12. As will be described in more detail below, the 

FIG. 5 is a block diagram of the current specialized general purpose media processor 12 receives, processes and 

processor solution for mass media communication, where 5 transmits media data streams in a bi-directional manner from 

FIG. 5 shows the current distributed system, and shows a upstream network components to downstream devices. In 

possible integrated approach: general, media data streams received from upstream nel- 

FIG. 6 is a block diagram of two presently preferred work components can comprise any combination of audio, 
general purpose media processors, where FIG. 6 shows a video, radio, graphics, encryption, authentication, and net- 
distributed system and shows an integrated media processor, 10 working information. As those skilled in the art will 

FIG. 7 is a block diagram of the presently preferred appreciate, however, the general purpose media processor 

structure of a general purpose, programmable media pro- 12 is in no way limited to receiving, processing and trans- 

cessor according to the invention; raitting only these types of media information. The general 

FIG. 8 is a drawing consisting of visual illustrations of the purpose media processor 12 of the invention is capable of 

various group operations provided on the media processor. »5 processing any form of digital media information without 

where FIG. %{a) illustrates the group expand operation. FIG. departing from the spirit and essential scope of the inven- 

8(6) illustrates the group compress or extract operation. FIG. * ton . 
8(c) illustrates the group deal and shuffle operations. FIG. 

8(d) illustrates the group swizzle operation and FIG. 8(0 System Configuration 

illustrates the various group permute operations: 20 fa ^ prcferred embodiment of the invention shown in 

FIG. 9 shows the preferred instruction and data sizes for FIG. L media data streams are communicated to the media 

the general purpose, programmable media processor, where processor 12 from several sources. Ideally, unified media 

FIG. 9(a) is an illustration of the various instruction formats data streams are received and transmitted by the general 

available on the general purpose, programmable media purpose media processor 12 over a fiber optic cable network 

processor. FIG. 9(b) illustrates the various floating-point 25 14 As will be described in more detail below, although a 

data sizes available on the general purpose media processor. fiber optic cable network is preferred* the presently existing 

and FIG. 9(c) illustrates the various fixed-point data sizes communications network in the United States consists of a 

available on the general purpose media processor; combination of fiber optic cable, coaxial cable and other 

FIG. 10 is an illustration of a presently preferred memory transmission media. Consequently, the general purpose 

management unit included in the general purpose processor 30 media processor 12 can also receive and transmit media data 

shown in FIG. 7. where FIG. 10(a) is a translation block streams over coaxial cable 14 and traditional twisted pair 

diagram and FIG. 10(i>) illustrates the functional blocks of wire connections 16. The specific communications protocol 

the transaction lookaside buffer employed over the twisted pair 16. whether POTS. ISDN or 

FIG. 11 is an illustration of a super-string pipeline tech- ADSL, is not essential; all protocols are supported by the 

nique; 35 broad band microcomputer 10. The details of these protocols 

FIG. 12 is an illustration of the presently preferred super- are generally known to those skilled in the art and no further 

spring pipeline technique; discussion is therefore needed or provided herein. 

FIG. 13 is a block diagram of a single memory channel for Another form of upstream network communication is 

communication to the general purpose media processor ^ through a satellite link 18. The satellite link 18 is typically 

shown in FIG. 7; connected to a satellite receiver 20. The satellite receiver 20 

FIG. 14 is an illustration of the presently preferred con- comprises an antenna, usually in the form of a satellite dish, 

nection of standard memory' devices to the preferred and amplification circuitry. The details of such satellite 

memory interface; communications are also generally known in the art. and 

FIG. 15 is a block diagram of the input/output controller 45 further detail is therefore not provided or included herein, 

for use with the memory channel shown in FIG. 13; As described above, the general purpose media processor 

FIG. 16 is a block diagram showing multiple memory 12 communicates in a bi-directional manner to receive, 
channels connected to the general purpose media processor process and transmit media data streams to and from down- 
shown in FIG. 7. where FIG. 16(a) shows a two-channel stream devices. As shown in FIG. 1. downstream coramu- 
implementation and FIG. 16(b) illustrates a twelve-channel 50 w cation preferably takes place in at least two forms. First, 
embodiment; media data streams can be communicated over a 

FIG. 17 illustrates the presently preferred packet commu- bi-directional local network 22. Various types of local net 

nications protocol for use over the memory channel shown 22 are generally known in the art and many different 

in FIG. 13; forms exist. The general purpose media processor 12 is 

FIG. 18 shows a multi-processor configuration employing 55 capable of communicating over any of these local networks 

the general purpose media processor shown in FIG. 7. where particular W« of networ * ***** is inpte™"- 

FIG. 18(a) shows a linear processor configuration. FIG. 1311011 specific. 

Wb) shows a processor ring configuration, and FIG. 18(c) The local network 22 is preferably employed to commu- 

shows a two-dimensional processor configuration; and nicate between the unified processor 12. and audio/visual 

FIG. 19 shows a presently preferred multi-chip imple- 60 devices 24 or other digital devices 26. Presently preferred 

mentation of the general purpose, programmable media examples of audio/visual devices 24 include digital cable 

processor of the invention. television, video-on-demand devices, electronic yellow 

pages services, integrated message systems, video 

DETAILED DESC RIPTION OF THE telephones, video games and electronic program guides. As 

PRESENTLY PREFERRED EMBODIMENTS 65 ta skflled io the an will appreciate, other forms of 

Referring to the drawings, where like-reference numerals audio/video devices are contemplated within the spirit and 

refer 10 like dements throughout, a broad band microcom- scope of the invention. Presently preferred embodiments of 
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other digital devices 26 for communication with the general 
purpose media processor 12 include personal computers, 
television sets, work stations, digital video camera 
recorders, and compact disc read-only memories. As those 
skilled in the art will also appreciate, further digital devices 
26 are contemplated for communication to the general 
purpose media processor 12 without departing from the 
spirit and scope of the invention. 

Second, the general purpose media processor preferably 
also communicates with downstream devices over a wireless 
network 28. In the presently preferred embodiment of the 
invention, wireless devices for communication over the 
wireless network 28 can comprise either remote communi- 
cation devices 30 or remote computing devices 32. Presently 
preferred embodiments of the remote communications 
devices 30 include cordless telephones and personal com- 
municators. Presently preferred embodiments of the remote 
computing devices 32 include remote controls and telecom- 
municating devices. As those skilled in the art will 
appreciate, other forms of remote communication devices 30 
and remote computing devices 32 are capable of communi- 
cation with the general purpose media processor 12 without 
departing from the spirit and scope of the invention. An agile 
digital radio (not shown) that incorporates a general purpose 
media processor 12 may be used to communicate with these 
wireless devices. 

Network Configuration 

Referring now to FIG. 2. the general purpose media 
processor 12 is preferably disposed throughout a digital 
communications network 38. In order to enable communi- 
cation among large and small businesses, residential cus- 
tomers and mobile users, the network 38 can consist of a 
combination of many individual sub-networks comprised of 
three main forms of interconnection. The trunk and main 
branches of the network 38 preferably employ fiber optic 
cable 40 as the preferred means of interconnection. Fiber 
optic cable 40 is used to connect between general purpose 
media processors 12 disposed as network servers 46 or large 
business installations 48 that are capable of coupling directly 
to the fiber optic link 40. For communications to small 
business and residential customers that may be incapable of 
directly coupling to the fiber optic cable 40, a general 
purpose media processor 12 can be used as an interface to 
other forms of network interconnection. 

As shown in FIG. 2, alternate forms of interconnection 
consist of coaxial cable lines 42 and twisted pair wiring 44. 
Coaxial cable lines are currently in place throughout the 
U.S. and is typically employed to provide cable television 
services to residential homes. According to the preferred 
embodiment of the invention, general purpose media pro- 
cessors 12 can be installed at these residential locations 52. 
In contrast to the specialized processor approach, the general 
purpose media processor 12 provides enough bandwidth to 
allow for bi-directional communications to and from these 
residential locations 52. 

Network servers 46 controlled by general purpose media 
processors 12 are also employed throughout the network 38. 
For example, the network servers 46 can be used to interface 
between the fiber optic network 40 and twisted pair wiring 
44. Twisted pair wiring 44 is still employed for small 
businesses 50 and residential locations 52 that do not or 
cannot currently subscribe to coaxial cable or fiber optic 
network services. General purpose media processors 12 are 
also disposed at these small business locations 50 and 
non-cable residential locations 52. General purpose media 
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processors 12 are also installed in wireless or mobile loca- 
tions 52. which are coupled to the network 38 through agile 
digital radios (not shown). As shown in FIG. 2. network 
databases or other peripherals 56 can also coupled to general 

5 purpose media processors 12 in the network 38. 

The general purpose media processor 12 is operable at 
significantly high bandwidths in order to receive, process 
and transmit unified media data streams. Referring to FIG. 
3, the respective frequencies for various types of media data 

10 streams are set forth against a bandwidth spectrum 60. The 
bandwidth spectrum 60 includes three component 
spectrums. all along the same range of frequencies, which 
represent the various frequency rates of digital media com- 
munications. Current computing bandwidth capabilities are 

15 also displayed. The telecommunications spectrum 62 shows 
the various frequency bands used for telecommunications 
transmission. For example, teletype terminals and modems 
operate in a range between approximately 64 bits/second to 
16 kilobits/second. The ISDN telecommunication protocol 

20 operates at 64 kilobits/second. At the upper end of the 
telecommunications spectrum 62. Tl and T3 trunks operate 
at one megabit per second and 32 megabits per second, 
respectively. The SONET frequency range extends from 
approximately 128 megabits per second up to approximately 

25 32 gigabits per second. Accordingly, in order to carry such 
broad band communications, the general purpose media 
processor 12 is capable of transferring information at rates 
into the gigabits per second range or higher. 

A spectrum of typical media data streams is presented in 

30 the media spectrum 64 shown in FIG. 3. Voice and music 
transmissions are centered at frequencies of approximately 
64 kilobits per second and one megabit per second, respec- 
tively. At the upper end of the media spectrum 64. video 
transmission takes place in a range from 128 megabits per 

35 second for high density television up to over 256 gigabits per 
second for movie applications. When using common video 
compression techniques, however, the video transmission 
spectrum can be shifted down to between 32 kilobits per 
second to 128 megabits per second as a result of the data 

40 compression. As described below, the processing required to 
achieve the data compression results in an increase in 
bandwidth requirements. 

Current computing bandwidths are shown in the comput- 
ing spectrum 66 of HG. 3. Serial communications presently 

45 take place in a range between two kilobits per second up to 
512 kilobits per second. The Ethernet network protocol 
operates at approximately 8 megabits per second. Current 
dynamic random access memory and other digital input/ 
output peripherals operate between 32 megabits per second 

so and 5 12 megabits per second. Presently available micropro- 
cessors are capable of operation in the low gigabits per 
second range. For example, the '386 Pentium microproces- 
sor manufactured by Intel Corporation of Santa Clara, Calif, 
operates in the lower half of that range, and the Alpha 

55 microprocessor manufactured by Digital Equipment Corpo- 
ration approaches the 16 gigabits per second range. 

When video compression is employed, as expressed 
above, the associated processing overhead reduces the effec- 
tive bandwidth of the particular processor. As a result, in 

60 order to handle compressed video, these processors must 
operate in the terahertz frequency range. The bandwidth 
spectrum 60 shown in FIG. 4 represents the effect of 
handling media data streams including compressed video. 
The computing spectrum 66 is skewed down to properly 

65 align the computing bandwidth requirements with the tele- 
communications spectrum 62 and the media spectrum 64. 
Accordingly, current processor technology is not sufficient 
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to handle the transmission and processing associated with operations are performed on single or unified media data 

complex streams of multi-media data. streams transmitted to and from the multiple precision ALU 

The current specialized processor approach to media 102 over a data bus or data path 108. Preferably the data path 

processing is illustrated in the block diagram shown in FIG. " *28 bits widc - those skilled in the art will 

5. As shown in FIG. 5. special purpose processors are 5 appreciate that the data path 108 can take on any width or 

coupled to a back plane 70. which is capable of transmitting size without departing from the spirit and scope of the 

instructions and data at the upper kilobits to Iowa gigabits invention. The wider the data path 10S the more unified 

per second range. In a typical configuration, an audio <nc4k data can be processed in parallel by the general 

processor 76. video processor 78. graphics processor 80 and pwpo^ media processor 12. 

network processor 82 are all coupled to the back plane 70. 10 Coupled to the multi-precision ALU 102 via the data path 

Each of the audio, video, graphics and network processors 108, and also an element of the execution unit 100. is a 

76-82 typically employ their own private or dedicated programmable switch 104. The programmable switch 104 

memories 84. which are only accessible to the specific performs data handling operations on single or unified media 

processor and not accessible over the back plane 70. As data streams transmitted over the data path 108. Examples of 

described above, however, unless video data streams are 15 such data handling operations include deals, shuffles, shifts, 

constantly being processed, for example, the video processor expands, compresses, swizzles, permutes and reverses. 

78 will sit idle for periods of time. The computing power of although other data handling operations are contemplated 

the dedicated video processor 78 is thus only available to These operations can be performed on single bits or bit fields 

handle video data streams and is not available to handle consisting of two or more bits up to the entire width of the 

other media data streams that are directed to other dedicated 20 data path 108. Thus, single bits or bit fields of various sizes 

processors. This, of course, is an inefficient use of the video can be manipulated through programmable operation of the 

processor 78 particularly in view of the overall processing switch 104. 

capability of this multi-processor system. Examples of the presently preferred data manipulation 
The general purpose media processor 12. in contrast operations performed by the general purpose media proces- 
handles a data stream of audio, video, graphics and network 25 sor 12 are shown in FIG. 8. A group expand operation is 
information all at the same time with the same processor. In visually illustrated in FIG. 8(o). According to the group 
order to handle the ever changing combination of daia types. expand operation, a sequential field of bits 270 can be 
the general purpose media processor 12 is dynamically divided into constituent sub-fields 272^-27 2J for insertion 
partitionable to allocate the appropriate amount of process- into a larger field array 274. The reverse of the group expand 
ing for each combination of media in a unified media data 30 operation is a group compress or extract operation. A visual 
stream. A block diagram of two preferred general purpose illustration of the group compress or extract operation is 
media processor system configurations is shown in FIG. 6. shown in FIG. 8(6). As shown, separate sub-fields 
Referring to FIG. 6. a general purpose media processor 12 212a-2JZd from a larger bit field 274 can be combined to 
is coupled to a high-speed back plane 90. The presently form a contiguous or sequential field of bits 270. 
preferred back plane 90 is capable of operation at 30 gigabits 35 Referring to FIGS. 8(c)-8(*), group deal. shufiQe. swizzle 
per second. As those skilled in the art will appreciate, back and permute operations performed by the programmable 
planes 90 that arc capable of operation at 400 gigabits per switch 104 are also illustrated. The operations performed by 
second or greater bandwidth are envisioned within the spirit these instructions are readily understood from a review of 
and scope of the invention. Multiple memory devices 92 are the drawings. The group manipulation operations illustrated 
also coupled to the back plane 90. which are accessible by in FIGS. 8{a )-8(e) comprise the presently contemplated data 
the general purpose media processor 12. Input/output manipulation operations for the general purpose media pro- 
devices 94 are coupled to the back plane 90 through a cesser 12. As those skilled in the art will appreciate, either 
dual-ported memory 92. The configuration of the input/ a subset of these operations or additional data manipulation 
output devices 94 on one end of the dual-ported memory 92 ^ operations can be incorporated in other alternate embodi- 
allows the sharing of these memory devices 92 throughout meals of the general purpose media processor 12 without 
a network 38 of general purpose media processors 12. departing from the spirit and scope of the invention. 

Alternatively. FIG. 6 shows a presently preferred inte- Referring again to FIG. 7. higher level mathematical 

grated general purpose media processor 12. The integrated operations than those performed by the multi-precision ALU 

processor includes on -board memory and I/O 86. The x i#2 are performed in the general purpose media processor 

on -board memory is preferably of sufficient size to optimize 12 through an extended math element 106. The extended 

throughput, and can comprise a cache and/or buffer memory math element 106 is coupled to the data path 108 and also 

or the like. The integrated media processor 12 also connects comprises part of the execution unit 100. The extended math 

to external memory 88. which is preferably larger than the element 106 performs the complex arithmetic operations 

on-board memory 86 and forms the system main memory. 55 accessary for video data compression and similarly intensive 

mathematical operations. One presently preferred example 
Execution Unit 0 j 2Q c^qj^ math operation comprises a Galois field 
One presently preferred embodiment of an integrated operation. Other examples of extended mamerr*rjcal func- 
general purpose media processor 12 is shown in FIG. 7. The "<>ns performed by the extended math element 106 include 
core of the integrated general purpose media processor 12 60 CRC generation and checking. Reed-Solomon code genera- 
comprises an execution unit 100. Three main elements or and checking, and spread -spectrum encoding and 
subsections are included in the execution unit 100. A mill- decoding. As those skilled in the art appreciate, additional 
tiplc precision arithmeticAogic unit ("ALU") 102 performs mathematical operations are possible and contemplated, 
all logical and simple arithmetic operations on incoming According to the preferred embodiment of the integrated 
media data streams. Such operations consist of calculate and 65 general purpose media processor 12. a register file 110 is 
control operations such as Boolean functions, as well as provided in addition to the execution unit 100 to process 
addition, subtraction, multiplication and division. These media data. The register file 110 stores and transmits data 
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streams to and from the execution unit 100 via the data path 
108. Rather than employing a complex set of specific or 
dedicated registers, the general purpose media processor 12 
preferably includes 64 general purpose registers in the 
register file 110 along with one program counter (not 
shown). The 64 general purpose registers contained in the 
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Appendix, the contents of which are hereby incorporated 
herein by reference. A list of the presently preferred major 
operation codes for the general purpose media processor 12 
appears below in Table L 



MAJOR OPERATION CODES 



MAJOR 0 


32 


64 


96 


128 


160 


192 


224 


0 


ERES 


GSHUFFLEI 


FMULADD16 


GMULADDl 


LU16LAI 


SAAS64LAI 


EADDIO 


BFE16 


1 


ESHUFFLE- 
14 MUX 


GSHUFFLE- 
14MUX 


FMULADD32 


GMULADD2 


LU16BAI 


SAAS64BAI 


EADDIUO 


BFNUE16 


2 




G SELECTS 


FMULADD64 


GMULADD4 


LU16U 


SCAS64LA1 


ESETIL 


BFNUGE16 


3 


EMDERT 


GMDEP1 




GMULADD8 


LU16BI 


SCAS64BAI 


ESETKE 


BFNUL16 


4 


EMUX 


GMUX 


FMULSUB16 


GMULADDl 6 


LU32LAI 


SMAS64LAJ 


ESETE 


BFE32 


5 


E8MUX 


G8MUX 


FMULSUB32 


GMULADD32 


LU32BAI 


SMAS64BAI 


ESETINE 


BFNUE32 


6 




GGFMULS 


FMULSUB64 


GMULADD64 


LU32LI 


SMUX64LAI 


ESETTUL 


BFNUGE32 


7 


ETRANSPOSE- 
aMUX 


GTRANSPOSE- 
8 MUX 




GEXTRACT128 


LU32BI 


SMUX64BAI 


ESETIUGE 


BFNUL32 


8 










LI61AI 


S16LA1 


ESUBIO 


BFE64 


9 


ESWTZZLE 


GSWEZLE 




GUMULADD2 


L16BA1 


516BAI 


ESUBIUO 


BFNUE64 


10 




GSW12ZLECOPY 




GUMULADD4 


L16U 


S16U 


ESUBIL 


BFNUGE64 






GSWT2ZLESWAP 




GUMULADD8 


L16B1 


S16B1 


ESUBKE 


EFNUL64 


12 


EDEPI 


GDEP1 


F16 


GUMULADD16 


L32LA1 


S32LAI 


ESUBE 


BFE12S 


13 


EUDEP1 


GUDEPI 


F.32 


GUMULADD32 


L32BAI 


S32BAI 


ESUB1NE 


BFNUE128 


14 


EWTHI 


GWTHI 


F.64 


GITMULADD64 


L32U 


S32U 


ESUBIUL 


BFKUGE 1 28 


15 


EUWTHI 


GUWTH1 




GUEXTRACT128 


L32BI 


S32BI 


ESUBIUGE 


BFNTJ0LJ28 


16 






GFMULADD16 


GEXTRACn 


L64LAI 


S64LAI 


EADDI 


BANDE 


17 






GFMITLADD32 


GEXHtACIU6 


L64BAI 


S64BA1 


EXORI 


BANDNE 


18 






GFMULADD64 


GEXTRACT1 32 


L64LI 


S64U 


EORI 


BL/BLZ 


19 






GFMULADD128 


GUEXTRACTI64 


L64BI 


S64BI 


EANDI 


BGE/BG3EZ 


20 






GFMUL5UB16 


GEXTRACT 


L128LAI 


S128LA! 


ESUB1 


BE 


21 






GFMULSUB32 


J.64 


L128BAI 


S128BAI 




BNE 


22 






GFMULSUB64 


GJEXTRACT 


L128U 


S128U 


ENORI 


BUUBGZ 


23 






GFMULSUB128 


J. 128 


L128BI 


S128BI 


ENANDI 


BUGE/BLEZ 


24 








G.l 


LBI 


SBI 




BGATEI 


25 








G.2 


LUBI 








26 








G.4 










27 








G.8 










28 




ECOPY1 


<3U6 


G.I6 






ECOPYI 


BI 


29 






GF32 


GJ2 








RIJNKI 


30 






GF.64 


G.64 










31 




E MINOR 


GF.123 

majc 


G.I28 
jt operation code field 


LAflNOR 
values 


SJvONOR 


E.MINOR 


B.MINOR 



register file 110 are all available to the u ser/prograrnmer. and 
comprise a portion of the user state of the general purpose 
media processor 12. Hie general purpose registers are pref- 
erably capable of storing any form of data. Each register 
within the register file 110 is coupled to the data path 108 
and is accessible to the execution unit 100 in the same 
manner. Thus* the user can employ a general purpose 
register according to the specific needs of a particular 
program or unique application. As those skilled in the art 
will appreciate, the register hie 110 can also comprise a 
plurality of register files 110 configured in parallel in order 
to support parallel multi-threaded processing. 

Instruction Set and User Programming 

Control or manipulation of data processed by the general 
purpose media processor 12 is achieved by selected instruc- 
tions programmed by the user. Those skilled in the art will 
appreciate that a great number of programs are possible 
through various sequences of instructions. Particular pro- 
grams can be developed for each unique implementation of 
the general purpose media processor 12. A detailed discus- 
sion of such specific programs is therefore beyond the scope 
of this description. 

One piesentry preferred instruction set for the general 
purpose media processor 12 is included in the Microfiche 



As shown in Table L the major operation codes are grouped 
according to the function performed by the operations. The 
operations are rhus arranged and listed above according to 
the presently preferred operation code number for each 
instruction. As many as 255 separate operations arc con- 
templated for the preferred embodiment of the general 
purpose media processor 12. As shown in Table L however, 
not all of the operation codes are presently implemented. As 
50 those skilled in (he art win appreciate, alternate schemes for 
organizing the operation codes, as well as additional opera- 
tion codes for the general purpose media processor 12. are 
possible. 

The instructions provided in the instruction set for the 
55 general purpose media processor 12 control the transfer, 
processing and manipulation of data streams between the 
register file 110 and the execution unit 100. The presently 
preferred width of the instruction path 112 is 32-bits wide, 
organized as four eight-bit bytes ("quadlets"). Those skilled 
60 in the art will appreciate, however, that the instruction path 
112 can take on any width without departing from the spirit 
and scope of the invention. Preferably, each instruction 
within the instruction set is stored or organized in memory 
on four-byte boundaries. The presently preferred format for 
65 instructions is shown in FIG 9(a)- 

As shown in FIG- 9{a). each of the presently preferred 
instruction formats for the general purpose media processor 
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12 includes a field 280 for the major operation code number 
shown in Table L Based on the type of operation performed 
the remaining bits can provide additional operands accord- 
ing to the type of addressing employed with the operation. 
For example, the remainder of the 32-bit instruction field can 
comprise an immediate operand ("ironO. or operands stored 
in any of the general registers Cra." "tb." "re." and "rd"). In 
addition, minor operation codes 282 can also be included 
among the operands of certain 32-bit instruction formats. 

The presently preferred embodiment of the general pur- 
pose media processor 12 includes a limited instruction set 
similar to those seen in Reduced Instruction Set Computer 
("RISC") systems. The preferred instruction set for the 
general purpose media processor 12 shown in Table I 
includes operations which implement load, store, 
synchronize, branch and gateway functions. These five 
groups of operations can be visually represented as two 
general classes of related operations. The branch and gate- 
way operations perform related functions on media data 
streams and are thus visually represented as block 114 in 
FIG. 7. Similarly, the load, store and synchronize operations 
are grouped together in block 116 and perform similar 
operations on the media data streams. (Blocks 114 and 116 
only represent the above classification of these operations 
and their function in the processing of media data streams, 
and do not indicate any specific underlying electronic 
connections.) A more detailed discussion of these 
operations, and the functionality of the general r*irpose 
media processor 12. appears in the Microfiche Appendix. 

The four-byte structure of instructions for the general 
purpose media processor 12 is preferably independent of the 
byte ordering used for any data structures. Nevertheless, the 
gateway instructions are specifically defined as 16-byte 
structures containing a code address used to securely invoke 
a procedure at a higher privilege leveL Gateways are pref- 
erably marked by protection information specified in the 
translation lookaside buffer 148 in the memory management 
unit 122. Gateways arc thus preferably aligned on 16-byte 
boundaries in the external memory. In addition to the general 
purpose registers and program counter, a privilege level 
register is provided within the register file 110 that contains 
the privilege level of the currently executing instruction. 

The instruction set preferably includes load and store 
instructions that move data between memory and the register 
file 110. branch instructions to compare the content of 
registers and transfer control, and arithmetic operations to 
perform computations on the contents of registers. Swap 
instructions provide multi-thread and multi-processor syn- 
chronization. These operations arc preferably indivisible and 
include such instructions as add-and-swap. compare- and- 
swap. and multiplex-and-swap instructions. The fixed-point 
compare- and- branch instructions within the instruction set 
shown in Table I provide the necessary arithmetic tests for 
equality and inequality of signed and unsigned fixed-point 
values. The branch through gateway instruction provides a 
secure means to access code at a higher privileged level in 
a form similar to a high level language procedure call 
generally known in the art. 

The general purpose media processor 12 also preferably 
supports floating-point compare-and-branch instructions. 
The arithmetic operations, which are supported in hardware, 
include floating-point addition, subtraction, multiplication, 
division and square root. The general purpose media pro- 
cessor 12 preferably supports other floating-point operations 
defined by the ANSI-IEEE floating-point standard through 
the use of software libraries. A floating point value can 
preferably be 16. 32. 64 or 1 28-bit s wide. Examples of the 
presenting preferred floating-point data sizes are illustrated 
in FIG. 9{b). 
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The general purpose media processor 12 preferably sup- 
ports virtual memory addressing and virtual machine opera- 
tion through a memory management unit 122. Referring to 
FIG. 10(a). one presently preferred embodiment of the 

5 memory management unit 122 is shown. The memory 
management unit 122 preferably translates global virtual 
addresses into physical addresses by software program- 
mable routines augmented by a hardware translation looka- 
side buffer ("TLB") 148. A facility for local virtual address 

to translation 164 is also preferably provided. As those skilled 
in the art will appreciate, the memory management unit 122 
includes a data cache 166 and a tag cache 16S that store data 
and tags associated with memory sections for each entry in 
the TLB 148. 

15 A block diagram of one preferred embodiment of the TLB 
148 is shown in FIG. 10(6). The TLB 148 receives a virtual 
address 230 as its input. For each entry in the TLB 148. the 
virtual address 230 is logically AND-ed with a mask 232. 
The output of each respective AND gate 234 is compared via 

20 a comparator 236 with each entry in the TLB 148. If a match 
is detected, an output from the comparator 236 is used to 
gate data 240 through a transceiver 238. As those skilled in 
the art will appreciate, a match indicates the entry of the 
corresponding physical address within the contents of the 

25 TLB 148 and no external memory or I/O access is required. 
The data 240 for the data cache 166 (FIG. 10(a)) is then 
combined with the remaining lower bits of the virtual 
address 230 through an exclusive -OR gate 242. The result- 
ant combination is the physical address 244 output from the 

30 TLB 148. If a match is not detected between the logical 
address and the contents of the tag cache 168. the memory 
management unit 122 an external memory or I/O access is 
necessary to retrieve the relevant portion of memory and 
update the contents of the TLB 148 accordingly. 

35 Using generally known memory management techniques, 
the memory management unit 122 ensures thai instructions 
(and data) are properly retrieved from external memory (or 
other sources) over an external input/output bus 126 (sec 
FIG. 7). As described in more detail below, a high bandwidth 

40 interface 124 is coupled to the external input/output bus 126 
to communicate instructions (and media data streams) to the 
general purpose media processor 12. The presently preferred 
physical address width for the general purpose media pro- 
cessor 12 is eight bytes (64-bits). In addition, the memory 

45 management unit 122 preferably provides match bits (not 
shown) that allow large memory regions to be assigned a 
single TLB entry allowing for fine grain memory manage- 
ment of large memory sections. The memory management 
unit 122 also preferably includes a priority bit (not shown) 

50 that allows for preferential queuing of memory areas accord- 
ing to respective levels of priority. Other memory manage- 
ment operations generally known in the art are also per- 
formed by the memory management unit 122. 

Referring again to FIG. 7. instructions received by the 

55 general purpose media processor 12 are stored in a com- 
bined instruction buffer/cache 118. The instruction buffer/ 
cache 118 is dynamically subdivided to store the largest 
sequence of instructions capable of execution by the execu- 
tion unit 100 without the necessity of accessing external 

60 memory. In a preferred embodiment of the invention, 
instruction buffer space is allocated to the smallest and most 
frequently executed blocks of media instructions. The 
instruction buffer thus helps maintain the high bandwidth 
capacity of the general purpose media processor 12 by 

65 sustaining the number of instructions executed per second at 
or near peak operation. That portion of the instruction 
buffer/cache 118 not used a* a buffer is. therefore, available 
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to be used as cache memory. The instruction buffer/cache 
118 is coupled to the instruction path 112 and is preferably 
32 kilobytes in size. 

A data buffer/cache 120 is also provided to store data 
transmitted and received to and from the execution unit 100 
and register file 110. The data buffer/cache 120 is also 
dynamically subdivided in a manner similar to that of the 
instruction buffer/cache 118. The buffer portion of the data 
buffer/cache 120 is optimized to store a set size of unified 
media data capable of execution without the necessity of 
accessing external memory. In a preferred embodiment of 
the invention, data buffer space is allocated to the smallest 
and most frequently accessed working sets of media data. 
Like the instruction buffer, the data buffer thus maintains 
peak bandwidth of the general purpose media processor 12. 
The data buffer/cache 120 is coupled to the data path 108 
and is preferably also 32 kilobytes in size. 

The preferred embodiment of the general purpose media 
processor 12 includes a pipelined instruction pre fetch struc- 
ture. Although pipelined operation is supported, the general 
purpose media processor 12 also allows for non-pipelined 
operations to execute without any operational penalty. One 
preferred pipeline structure for the general purpose media 
processor 12 comprises a "super- string** pipeline shown in 
FIG. 11. A super-string pipeline is designed to fetch and 
execute several instructions in each clock cycle. The instruc- 
tions available for the general purpose media processor 12 
can be broken down into five basic steps of operation. These 
steps include a register-to-register address calculation, a 
memory load, a register-to-register data calculation, a 
memory store and a branch operation. According to the 
super-string pipeline organization of the general purpose 
media processor 12. one instruction from each of these five 
types may be issued in each clock cycle. The presently 
preferred ordering of these operations are as listed above 
where each of the five steps are assigned letters "A." "L." 
"E." and (see pro. ii). 

According to the super-string pipelining technique, each 
of the instructions are serially dependent, as shown in FIG. 
11. and the general purpose media processor 12 has the 
ability to issue a string of dependent instructions in a single 
clock cycle. These instructions shown in FIG. 11 can take 
from two to five cycles of latency to execute, and a branch 
prediction mechanism is preferably used to keep up the 
pipeline filled (described below). Instructions can be 
encoded in unit categories such as address, load, store/sync, 
fixed, float and branch to allow for easy decoding. A similar 
scheme is employed to pre -fetch data for the general purpose 
media processor 12. 

As those skilled in the art will appreciate, the super- string 
pipeline can be implemented in a multi-threaded environ- 
ment In such an implementation, the number of threads is 
preferably relatively prime with respect to functional unit 
rates so that functional units can be scheduled in a non- 
interfering fashion between each thread. 

In another more preferred embodiment a "super- spring** 
pipelining scheme is employed with the general purpose 
media processor 12. The super- spring pipeline technique 
breaks the super-string pipeline shown in FIG. 11 into two 
sections that are coupled via a memory buffer (not shown). 
A visual representation of the super-spring pipeline tech- 
nique is shewn in FIG. 12. The front of (he pipeline 204. in 
which address calculation (A), memory load (L). and branch 
(B) operations are handled, is decoupled from the back of 
the pipeline 206. in which data calculation (E) and memory 
store (S) operations arc handled. The decoupling is accom- 
plished through the memory buffer (not shown), which is 
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preferably organized in a first-in -first-out ("FIFO") fast/ 
dense structure. (The memory buffer is functionally repre- 
sented as a spring in FIG. 12.) 

As indicated in Table I above, the general purpose media 

5 processor 12 does not include delayed branch instructions, 
and so relies upon branch or fetch prediction techniques to 
keep the pipeline full in program flows around unconditional 
and conditional branch instructions. Many such techniques 
are generally known in the art. Examples of some presently 

10 preferred techniques include the use of group compare and 
set. and multiplex operations to eliminate unpredictable 
branches; the use of short forward branches, which cause 
pipeline neutralization; and where branch and link predicts 
the return address in a one or more entry stack- In addition. 

15 the specialized gateway instructions included in the general 
purpose media processor 12 allow for branches to and from 
protected virtual memory space. The gateway instructions, 
therefore, allow an efficient means to transfer between 
various levels of privilege. 

20 As described above, two basic forms of media data are 
processed by the general purpose media processor 12. as 
shown in FIG. 7. These data streams generally comprise 
Nyquist sampled I/O 128. and standard memory and I/O 
130. As shown in FIG. 7. audio 132. video 134. radio 13*. 

25 network 138. tape 140 and disc 142 data streams comprise 
some examples of digitally sampled I/O 128. As those 
skilled in the art will appreciate, other forms of digitally 
sampled I/O are contemplated for processing by the general 
purpose media processor 12 without departing from the 

30 spirit and scope of the invention. Standard memory and I/O 
130 comprises data received and transmitted to and from 
general digital peripheral devices used in the design of most 
computer systems: As shown in FIG. 7. some examples of 
such devices include dynamic random access memory 

35 ("DRAM") 146. or any data received over the PCI bus 144 
generally known in the ait Other forms of standard memory 
and I/O sources are also contemplated. The various fixed- 
point data sizes preferred for the general purpose media 
processor 12 are illustrated in FIG. 9(c). 

40 

External Interface 

As mentioned above, the general purpose media processor 
12 includes a high bandwidth interface 124 to communicate 
with external memory and input/output sources. As part of 

45 the high bandwidth interface 124. the general purpose media 
processor 12 integrates several fast communication channels 
156 (FIG. 13) to communicate externally. These fast com- 
munication channels 156 preferably couple to external 
caches 150. which serve as a buffer to memory interfaces 

so 152 coupled to standard memory 154. The caches 150 
preferably comprise synchronous static random access 
memory ("SRAM"), each of which are sixty-four kilobytes 
in size; and the standard memories 154 comprise DRAM's* 
The memory interfaces 152 transmit data between the 

55 caches 150 and the standard memories 154. The standard 
memories 154 together form l he main external memory for 
the general purpose media processor 12. The cache 150. 
memory interface 152. standard memory 154 and input/ 
output channel 156 therefore make up a single external 

60 memory unit 158 for the general purpose media processor 
12. 

According to the presently preferred embodiment of the 
invention, the memory interface protocol embeds read and 
write operations to a single memory space into packets 
65 containing command, address, data and acknowledgment 
information. The packets preferably include check codes 
that will detect single-bit transmission errors and some 
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multiple-bit errors. As many as eight operations may be in 
progress at a time in each external memory unit 158. As 
shown in FIG. 13, up to four external memory units 158 may 
be cascaded together to expand the memory available to the 
general purpose media processor 12. and to improve the 
bandwidth of the external memory. Through such cascaded 
memory units 158. the memory interface 152 provides for 
the direct connection of multiple banks of standard memory 
154 to maintain operation of the general purpose media 
processor 12 at sustained peak bandwidths. 

According to one embodiment shown in FIG. 13. up to 
four standard memory devices 154 can be coupled to each 
memory interface 152. Each standard memory 154 thus 
includes as many as four banks of DRAM, each of which is 
preferably sixteen bits wide. The standard memories 154 are 
connected in parallel to the memory interface 152 forming 
a 72-bit wide data bus 160. where 64 bits are preferably 
provided for data transfer and eight bits are provided for 
error correction. In addition to the data bus 160, an address/ 
control bus 162 is coupled between the memory interface 
152 and each standard memory 154. The address/control bus 
162 preferably comprises at least twelve address lines (4 
kilobits xl 6 memory size) and four control lines as shown in 
FIG. 13. An alternate manner for coupling the DRAM's to 
the memory interface 152 is illustrated in FIG. 14. As shown 
in FIG. 14. two banks of four DRAM single in-line memory 
modules are coupled in parallel to the memory interface 152. 
The memory interface 152 also supports interleaving to 
enhance bandwidth, and page mode accesses to improve 
latency for localized addressing. 

Using standard DRAM components, the external memory 
units 158 achieve bandwidths of approximately two 
gigabits/second with the standard memories 154. When four 
such external memory units 158 are coupled via the com- 
munication channel 156. therefore, the total bandwidth of 
the external main memory system increases to one gigabyte/ 
second As discussed further below, in implementations with 
two or eight communication channels 156. the aggregate 
bandwidth increases to two and eight gigabytes/second, 
respectively. 

A more detailed depiction of the communication channel 
156 circuitry appears in FIG. 15. According to the preferred 
embodiment of the invention, each communication channel 
156 comprises two unidirectional, byte-wide, differential, 
packet-oriented data channels 156a. lS6b (see FIG. 13). As 
explained above, where memory units 158 are cascaded 
together in series, the output of one memory unit 158 is 
connected to the input of another memory unit 158. The two 
unidirectional channels are thus connected through the 
memory units 158 forming a loop structure and make up a 
single bi-directional memory interface channel. 

Referring to FIG. 15. each communication channel 156 is 
preferably eight bits wide, and each bit is transmitted 
differentially. For example, output transceiver 170 for bit 
transmits both D 0 and /D 0 signals over the communi- 
cation channel 156. Additional transceivers arc similarly 
provided for the remaining bits in the channel 156. (The 
transceiver 176 for bit D 7<JM , and associated differential lines 
178. 180 are shown in FIG. 15.) A CLK^ transceiver 182 
is also provided to generate differential clock outputs 184. 
186 over the channel 156. To complete the link between 
memory units 158. input transceivers 188-192 are provided 
in each memory unit 158 for each of the differential bits and 
clock signals transmitted over the communication channel 
156. These input signals 17Z 174. 178. 180. 184. 186 are 
preferably transmitted through input buffers 15*4-198 to 
other pans of the memory unit 158 (described above). 
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Each memory unit 158 also includes a skew calibrator 200 
and phase locked loop fPLL") 202. The skew calibrator 200 
is used to control skew in signals output to the communi- 
cation channel 156. Preferably, digital skew fields are 

5 employed, which include set numbers of delay stages lo be 
inserted in the output path of the communication channel 
156. Setting these fields, and the corresponding analog skew 
fields, permits a fine level of control over the relative skew 
between output channel signals. 

10 The PLL 202 recovers the clock signal on either side of 
the communication channel 156 and is thus provided to 
remove clock jitter. The clock signals 184. 186 preferably 
comprise a single phase, constant rate clock signal. The 
clock signals 184. 186 thus contain alternating zero and one 

1S values transmitted with the same timing as the data signals 
172. 174. 178. 180. The clock signal frequency is. therefore, 
one-half the byte data rate. The communication channel 156 
preferably operates at constant frequency' and contains no 
auxiliary control, handshaking or flow control information. 

20 Each external memory unit 158 preferably defines two 
functional regions: a memory region, implemented by the 
cache 150 backed by standard memory 154 (see FIG. 13). 
and a configuration region, implemented by registers (not 
shown). Both regions are accessed by separate interfaces; 

25 the communication channel 156 is used to access the 
memory region, and a serial interface (described below) is 
used to access the configuration region. In the memory 
region, the caches 150 arc preferably write-back (write-in) 
single-set (direct-map) caches for data originally contained 

30 in standard memory 154. All accesses to memory space 
should maintain consistency between the contents of the 
cache 150 and the contents of the standard memory 154. The 
configuration region registers provide the mechanism to 
detect and adjust skew in the communication channel 156. 

35 Software is preferably employed to adaptively adjust the 
skew in the channel 156 through digital skew fields, as 
explained above. The serial interface thus is used to con- 
figure the external memory units 158, set diagnostic modes 
and read diagnostic information, and to enable the use of a 

40 high-speed tester (not shown). 

One presently preferred embodiment of the invention 
employs two byte-wide packet communication channels 156 
(FIG. 16(a)). In order to further increase the bandwidth of 
the general purpose media processor 12. up to sixteen 

45 byte-wide packet communication channels 156 can be 
employed. Referring to FIG. 16(d). twelve communication 
channels, comprising eight memory channels 210. a ninth 
channel for parallel processing 212 (described below), and 
three input/output ("IfCT) channels 214. are shown. Each of 

50 the communication channels 210-214 preferably employs 
the cascade configuration of four channel interface devices 
216. (Each channel interface device 216 coupled to the 
memory channels 210 corresponds to the external memory 
unit 158 shown in FIG. 13.) Through each of the twelve 

55 communication channels shown in FIG. 16(6). the general 
purpose media processor 12 can request or issue read or 
write transactions. When not interleaved, the twelve chan- 
nels provide a single contiguous memory space for each 
channel interface device 216. 

60 Alternatively, memory accesses may be interleaved in 
order to provide for continuous access to the external 
memory system at the maximum bandwidth for the DRAM 
memories. In an interleaved configuration, at any point in 
time some memory devices will be engaged in row pre- 

65 charge, while others may be driving or receiving data, or 
receiving row or column addresses. The memory interface 
152 (FIG. 13) thus preferably maps between a contiguous 
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address space and each of the separate address spaces made Interface Communication 

^?m^™^' X ^r ra0ry S" ^ ^ According to one presently preferred embodiment of the 

^ZF^TfT ? * ? C "TO imCrfaCC w ' invention, the channel interfacTdevices 216 can operate as 

ad * c ? sc J. are ha f nd,C ^ cither master devices or slave devices. A master device is 

by different memory devices. Moreover, in the preferred 5 bIc ^ gcncrating a rcqucst on mc eornramncation 

embodiment additional memory operations may be ch ^ ncI living responses from the cornmunica- 

requested be ore the corresponding DRAM bank is avail- ^ cnannd 156 siavedeviceTare capable of receiving 

? rn^Tl^v ™T lon * !* e requests and generating responses, over the communication 

in a queue unW they can be processed According to the c ^ nel 156 * A ^ ^ vic€ is preferably ap3blt of 

Referred embodiment, memory wntcs have lower pnonty 1Q gcneiating a frequency clock signal and accepting 

than memory reads, unless an attempt is made to read an signal$ at ^ samc clock f feque ncy over the comraunicaUon 

address that is queued for a write operation. As those skilled channel 156. A slave device, therefore, should operate at the 

in the art will appreciate, the depth of the memory write same clock rate as the communication channel 156. and 

queue is dictated by the specific implementation. generate no more than a specified amount of variation in 

Although up to four external memory units 158 are l5 output clock phase relative to input clock phase. The master 

preferably cascaded to form effectively larger memories. device, however, can accept an arbitrary input clock phase 

some amount of latency may be introduced by the cascade. and tolerates a specified amount of variation in clock phase 

Packets of data transmitted over the communication channel over operating conditions. 

156 are uniquely addressed to a particular channel interface Packets of information sent over the communication 
device 216. A packet received at a particular device, which ^ channel 156 preferably contain control commands, such as 
specifies another module address, is automatically passed to read or write operations, along with addresses and associated 
the correct channel interface device 216. Unless the module data. Other commands are provided to indicate error con- 
address matches a particular device 216, that packet simply ditions and responses to the above commands. When the 
passes from the input to the output of the interface device communication channel 156 is idle, such as during initial- 
216. This mechanism divides the serial interconnection of ^ ization and between transmitted packets, an idle packet, 
interface devices 216 into strings, which function as a single consisting of an all- zero byte and an all-one byte is trans- 
larger memory or peripheral, but with possibly longer mitted through the cornmunication channel 156. Each noo- 
response latency. {die packet consists of two bytes or a multiple of two bytes. 

In addition to the memory channels 210. the general and begins with a byte having a value other than all zeros, 

purpose media processor 12 provides several communica- 30 All packets transmitted over the communication channel 156 

lion channels 214 for communication with external input/ also begin during a clock period in which the dock signal is 

output devices. Referring to FIG. 16(b). three input/output zero, and all packets preferably end during a clock period in 

channels 214 having SRAM buffered memory (sec FIG. 13) which the dock signal is one. A depiction of the preferred 

provide an interface to external standard I/O devices (not packet protocol format for transmission over the eommuni- 

shown). Like the eight memory channels 210. the three I/O 35 canon channd 156 appears in FIG. 17. 

channels 214 are byte-wide input/output channels intended The general farm of each packet is an array of bytes 

to operate at rates of at least one gigahertz. The three VO preferably without a specific byte ordering. The first byte 

channels 214 also operate as a packet communication link to contains a module address 250 ("ma") in the high order two 

synchronous SRAM memory 208 within the channel inter- bits; a packet identifier, usually a command 252 ("com"), in 

face device 216. A controller 226 within the channel inter- 40 the next three bit positions; and a link identification number 

face device 216 completes the interface to the I/O devices. 254 ("U<T) in the last three bit positions. The interpretation 

The three I/O channels 214 preferably function in like of the remaining bytes of a packet depend upon the contents 

manner to the memory channels 210 described above. The of the packet identifier. The length of each packet is pref- 

interface protocol for the three I/O channels 214 divides read erably implied by the command specified in the initial byte 

and write operations to a single memory space into packets 45 of the packet A check byte is provided and computed as odd 

containing command, address, data and acknowledgment bit- wise parity with a leftward circular rotation after accu- 

information. The packets also include a check code that will molating each byte. This technique provides detection of all 

detect single-bit transmission errors and some multiple-bit single-bit and some multiple -bit errors, but no correction is 

errors. According to the preferred einbodiment of the provided. 

invention, as many as eight operations may progress in each 50 The modular address 250 field of each packet is prefer- 
interface device 216 at a rime. As shown in FIG. 16(6). up ably a two-bit field and allows for as many as four slave 
to four channel interface devices 216 can be cascaded devices to be operated from a single communication channel 
together to expand the bandwidth in the three VO channels 156. Module address values can be assigned in one of two 
214. A bit-serial interface (not shown) is also provided to fashions: either dynamically assigned through a configura- 
each of the channel interface devices 216 to allow access to 55 tion register (not shown), or assigned via static/geometric 
configuration, diagnostic and tester information at standard configuration pins. Dynamic assignment through a configu- 
TTL signal levels at a more moderate data rate. (A more ration register is the presently preferred method for assign- 
detailed description of the serial interface is provided ing module address values. 

telow). The link identification number 254 field is preferably 

Like the memory channels 210. each VO channel 214 60 3 -bits wide and provides the opportunity for master devices 

includes nine signals — one clock signal and eight data to initiate as many as eight independent operations at any 

signals. Differential voltage levels are preferably employed one time to each slave device. Each outstanding operation 

for each signal Each channel interface device 216 is pref- requires a distinct link identification niimber. but no ordering 

erably terminated in a nominal 50 ohm impedance to of operations should be implied by the value of the link 

ground. This impedance applies for both inputs and outputs 65 identification fidd. Thus, there is preferably no requirement 

to the cornmunication channel 156. A programmable termi- for link identification values 254 to be sequentially assigned 

nation irnr*dance is preferred. cither in requests or responses. 
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The receipt of packets over the communication channel 
156 that do not conform to the channel protocol preferably 
generates an error condition. As those skilled in the an will 
appreciate, the level or degrees to which a specific imple- 
mentation detects errors is defined by the user. In one 
presently preferred embodiment of the invention, all errors 
arc detected and the following protocol is employed for 
handling errors. For each error detected, the channel inter- 
face device 216 causes a response explicitly indicating the 
error condition. Channel interface devices 216 reporting an 
invalid packet will then suppress the receipt of additional 
packets until the error is cleared. The transmitted packet is 
otherwise ignored. However, even though the erroneous 
packet is ignored, the channel interface devices 216 prefer- 
ably continue to process valid packets that have already been 
received and generate responses thereto. An identification of 
the presently preferred commands 252 to be used over the 
communication channel 15* are listed in FIG. 17. 

In the master/slave preferred embodiment, the channel 
interface devices 216 forward packets that are intended for 
other devices connected to the communication channel 156. 
as described above. In slave devices, forwarding is per- 
formed based on the module address 256 field of the packet 
Packets which contain a module address 256 other than that 
of the current device are forwarded on to the next device. All 
non-idle packets are thus forwarded including error packets. 
In master devices, forwarding is performed based on the link 
identifier number 254 of the packet. Packets that contain link 
identifier numbers 254 not generated by the specific channel 
interface device 216 are forwarded. In order to reduce 
transmission latency, a packet buffer may be provided. As 
those skilled in the art appreciate, the suitable size for the 
packet buffer depends on the amount of latency tolerable in 
a particular implementation. 

A variety of master/slave ring configurations are possible 
using the high bandwidth interface 124 of the invention. 
Five ring configurations arc currently preferred: single- 
master, dual-master, multiple- master, single- slave and 
miduple-master/raultiple -slave. The simplest ring configu- 
ration contains a single □ on- forwarding master device and a 
single non-forwarding slave device. No forwarding is 
required far either device in this configuration as packets are 
sent directly to the recipient. A single-master ring, however, 
may contain a cascade of up to four slave devices (see FIGS. 
13. 16). In the single-master ring configuration, each slave 
device is configured to a distinct module address, and each 
slave device forwards packets that contain module address 
fields unequal to their own. As discussed above, a single- 
master ring provides a larger memory or I/O capacity than 
a master-slave pair, but also introduces a potentially longer 
response latency. In the single-master ring, each slave device 
may have as many as eight transactions outstanding at any 
time, as described above. 

The remaining combinations share many of the above 
basic attributes. In a dual-master pair, each master device 
may initiate read and write operations addressed to the other, 
and each may have up to eight such transactions outstanding. 
No forwarding is required for cither device because packets 
are sent directly to the recipient A multiple-master ring may 
contain multiple master devices and a single slave device. In 
this configuration, the slave device need not forward packets 
as all input packets are designated for the single slave 
device. A multiple -master ring may contain multiple master 
devices and as many as four slave devices. Each slave device 
may have up to eight transactions outstanding, and each 
master device may use some of those transactions. In a 
preferred embodiment, a master also has the capability to 
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detect a time-out condition or when a response to a request 
packet is not received. Further aspects of inter-processor 
communications and configurations are discussed below in 
connection with FIG. 18. 

5 

Serial Bus 

In one preferred embodiment of the invention, the general 
purpose media processor 12 includes a serial bus (not 
shown). The serial bus is designed to provide bootstrap 

10 resources, configuration, and diagnostic support to the gen- 
eral purpose media processor 12. The serial bus preferably 
employs two signals, both at TI L levels, for direct commu- 
nication among many devices. In the preferred embodiment, 
the first signal is a continuously running clock, and the 

15 second signal is an open-collector bi-directional data signal. 
Four additional signals provide geographic addresses for 
each device coupled to the serial bus. A gateway protocol, 
and optional configurable addressing, each provide a means 
to extend the serial bus to other buses and devices. Although 

20 the serial bus is designed for implementation in a system 
having a general purpose media processor 12. as those 
skilled in the art will appreciate, the serial bus is applicable 
to other systems as well. 

Because the serial bus is preferably used for the initial 
bootstrap program load of the general purpose media pro- 
cessor 12. the bootstrap ROM is coupled to the serial bus. As 
a result, the serial bus needs to be operational for the first 
instruction fetch. The serial bus protocol is therefore devised 
so that no transactions are required for initial bus configu- 
ration or bus address assignment. 

According to the preferred embodiment, the clock signal 
comprises a continuously running dock signal at a minimum 
of 20 megahertz. The amount of skew, if any. in the clock 

35 signal between any two serial bus devices should be limited 
to be less than the skew on the data signal. Preferably, the 
serial data signal is a non-inverted open collector 
bi-directional data signal. TTL levels are preferred for 
communication on the serial bus. and several termination 

w networks may be employed for the serial data signal. A 
simple preferred termination network employs a resistive 
pull-up of 220 ohms to 33 volts above An alternate 
embodiment employs a more complex termination network 
such as a termination network including diodes or the 

45 "Forced Perfect Termination" network proposed for me 
SCSI-2 standard, which may be advantageous for larger 
configurations. 

The geographic addressing employed in the serial bus is 
provided to insure that cadi device is addressable with a 

50 number that is unique among all devices on the bus and 
which also preferably reflects the physical location of the 
device. Thus, the address of each device remains the same 
each time the system is operated. In one preferred 
embodiment the geographic address is composed of four 

55 bits, thus allowing for up to 16 devices. In order to extend 
the geographic addressing to more than 16 devices, addi- 
tional signals may be employed such as a buffered copy of 
the clock signal or an inverted copy of die clock signal (or 
both). 

60 The serial bus preferably incorporates both a bit level and 
packet protocol. The bit level protocol allows any device to 
transmit one bit of information on the bus. which is received 
by all devices on the bus at the same time. Each transmitted 
bit begins at the rising edge of the dock signal and ends at 

63 the next rising edge. The transmitted bit value is sampled at 
the next rising edge of the clock signal. According to one 
preferred embodiment where the serial data signal is an open 
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collector signal, the transmission of a zero bit value on the 
bus is achieved by driving the serial data signal to a logical 
low value. In this embodiment, the transmission of a one bit 
value is achieved by releasing the serial data signal to obtain 
a logical high value. If more than one device attempts to 5 
transmit a value on the same clock the resulting value is a 
zero if any device transmits a zero value, and one if all 
devices transmit a one value. This provides a 'Vired-ANIT 
collision mechanism, as those skilled in the art will appre- 
ciate. If two or more devices transmit the same value on the 10 
same clock cycle, however, no device can detect the occur- 
rence of a collision. In such cases, the transaction, which 
may occur frequently in some implementations, preferably 
proceeds as described below. 

The packet protocol employed with the serial bus uses the 15 
bit level protocol to transmit information in units of eight 
bits or multiples of eight bits. Each packet transmission 
preferably begins with a start bit in which the serial data 
signal has a zero (driven) value. After transmitting the eight 
data bits, a parity bit is transmitted. The transmission con- 20 
tinues with additional data. A single one (released) bit is 
transmitted immediately following the least significant bit of 
each byte signaling the end of the byte. 

On the cycle following the transmission of the parity bit. 
any device may demand a delay of two cycles to process the 25 
data received. The two cycle delay is initiated by driving the 
serial data signal (to a zero value) and releasing the serial 
data signal on the next cycle. Before releasing the serial data 
signal, however, it is preferable to insure that the signal is 
not being driven by any other device. Further delays are 
available by repeating this pattern. 3 

In order to avoid collisions, a device is not permitted to 
start a transmission over the serial bus unless mere are no 
currently executing transactions. To resolve collisions that 
may occur if two devices begin transmission on the same 
cycle, each transmitting device should preferably monitor 35 
the bus during the transmission of one (released) bits. If any 
of the bits of the byte arc received as zero when transmitting 
a one. the device has lost arbitration and must cease trans- 
mission of any additional bits of the current byte or trans- 
action. 40 

According to the preferred emrxnliment of the invention, 
a serial bus transaction consists of the transmission of a 
series of packets. The transaction begins with a transmission 
by the transaction initiator, which specifies the target 
network, device, length, type and payload of the transaction 45 
request. The transaction terminates with a packet having a 
type field in a specified range. As a result, all devices 
connected to the serial bus should monitor the serial data 
signal to determine when transactions begin and end. A 
serial bus network may have multiple simultaneous trans- 50 
actions occurring, however, so long as the target and initiator 
network addresses are all disjoint. 

Parallel Processing 

In one preferred cmrxxliment of the invention, two or 53 
mere general purpose media processors 12 can be linked 
together to achieve a multiple processor system. According 
to this embodiment, general purpose media processors 12 
are linked together using their high bandwidth interface 
channels 124. either directly or through external switching 
components (not shown). The dual -master pair configuration 60 
described above can thus be extended for use in multiple- 
master ring configurations. Preferably, internal daemons 
provide for the generation of memory references to remote 
processors, accesses to local physical memory space, and the 
transport of remote references to other remote processors. In 65 
a multi-processor environment, all general purpose media 
processors 12 run off of a common clock frequency, as 
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required by the communication channels 156 that connect 
between processors. 

Referring to FIG- 18, each general purpose media pro- 
cessor 12 preferably includes at least a pair of inter- 
processor links 218 (see also FIG. 16(b)). In one 
configuration, both pairs of inter-processor links 218 can be 
connected between the two processors 12 10 further enhance 
bandwidth. As shown in FIG. 18(a) several processors 12 
may be interconnected in a linear network, employing the 
transponder daemons in each processor. In an alternate 
embodiment shown in FIG. 18(b). the inter-processor links 
218 may be used to join the general purpose media proces- 
sors 12 in a ring configuration. Alternatively still, general 
purpose media processors 12 may be interconnected into a 
two-dimensional network of processors of arbitrary size, as 
shown in FIG. 18(c). Sixteen processors are connected in 
FIG. 18(e) by connecting four ring networks. In yet another 
alternate einbodiment. by connecting the inier-processor 
links 218 to external switching devices (not shown), multi- 
processors with a large number of processors can be con- 
structed with an arbitrary interconnection topology. 

The requester, rcsponder and transponder daemons pref- 
erably handle all inter-processor operations. When one gen- 
eral purpose media processor 12 attempts a load or store to 
a physical address of -a remote processor, the requester 
daemon autonomously attempts to satisfy the remote 
memory reference by communicating with the external 
device. The external device may comprise another processor 
12 or a switching device (not shown) that eventually reaches 
another processor 12. Preferably, two requester daemons are 
provided each processor 12. which act concurrently on two 
different byte channels and/or module addresses. The 
rcsponder daemon accepts writes from a specified channel 
and module address, which enables an external device to 
generate transaction requests in local memory or to generate 
processor events. The rcsponder daemon also generates link 
level writes to the same external device that communicated 
responses for the received transaction request. Two such 
rcsponder daemons are preferably provided; each of which 
operate concurrently to two different byte channels and/or 
module addresses. 

The transponder daemon accepts writes from a specified 
channel and module address, which enable an external 
device to cause a requester daemon to generate a request on 
another channel and module address. Preferably, two such 
transponder daemons are provided, each of which act con- 
currently (back-to-back) between two different byte channel 
and/or module addresses. As those skilled in the art will 
appreciate, the requester, rcsponder and transponder dae- 
mons must act cooperatively to avoid deadlock that may 
arise due to an imbalance of requests in the system Dead- 
locks prevent responses from being routed to their 
destinations, which may defeat the benefits of a multi- 
processor distributed system. 

According to one presently preferred emtx)dimeat of the 
invention, the general purpose media processor 12 can be 
implemented as one or more integrated circuit chips. Refer- 
ring to FIG. 19, the presently preferred embodiment of the 
general purpose media processor 12 consists of a four-chip 
set In the four-chip set a general purpose media processor 
12 is manufactured as a stand alone integrated circuit. The 
stand alone integrated circuit includes a memory manage- 
ment unit 122. instruction and data cache/buffers 118. 120. 
and an execution unit 100. A plurality of signal input/output 
pads 260 are provided around the circumference of the 
integrated circuit to communicate signals to and from the 
general purpose media processor 12 in a manner generally 
known in the art 

The second and third chips of the four-chip set comprise 
in an external memory element 158 and a channel interface 
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device 216. The external memory element 158 includes an 
interface to the communication channel 156, a cache 150 
and a memory interface 152. The channel interface device 
216 also includes an interface to the communication channel 
156. as well as buffer memory 262. and input/output inter- 
faces 264. Both the external memory element 158 and the 
channel interface device 216 include a plurality of input/ 
output signal pads 260 to communicate signals to and from 
these devices in a generally known manner. 

The fourth integrated circuit chip comprises a switch 226. 
which allows for installation of the general purpose media 
processor 12 in the heterogeneous network 38. In addition to 
the plurality of input/output pads 260. the switch 226 
includes an interface to the communication channel 156. The 
switch 226 also preferably includes a buffer 262. a router 
266, and a switch interface 268. 

As those skilled in the art will appreciate, many imple- 
mentations for the general purpose media processor 12 are 
possible in addition to the four-chip implementation 
described above. Rather than an integrated approach, the 
general purpose media processor can be implemented in a 
discrete manner. Alternatively, the general purpose media 
processor 12 can be implemented in a single integrated 
circuit or in an implementation with fewer man four inte- 
grated circuit chips. Other combinations and permutations of 
these implementations are contemplated. 

There has been described a system for processing streams 
of media data at substantially peak rates to allow for real 
time communication over a large heterogeneous network. 
The system includes a media processor at its core that is 
capable of processing such media data streams. The hetero- 
geneous network consists of. for example, the fiber optic/ 
coaxial cable/twisted wire network in place throughout the 
U.S. To provide for such communication of media data, a 
media processor according to the invention is disposed at 
various locations throughout the heterogeneous network. 
The media processor would thus function both in a server 
capacity and at an end user site within the network. 
Examples of such end user sites include televisions, set-top 
converter boxes, facsimile machines, wireless and cellular 
telephones, as well as large and small business and industrial 
applications. 

To achieve such high rates of data throughput, the media 
processor includes an execution unit, high bandwidth 
interface, memory management unit and pipelined instruc- 
tion and data paths. The high bandwidth interface includes 
a mechanism for transmitting media data streams to and 
from the media processor at rates at or above the gigahertz 
frequency range. The media data stream can consist of 
transmission, presentation and storage type data transmitted 
alone or in a unified manner. Examples of such data types 
include audio, video, radio, network and digital communi- 
cations. According to the invention, the media processor is 
dynamically paruUonable to process any combination or 
permutation of these data types in any size. 

A programmable, general purpose media processor sys- 
tem presents significant advantages over current multimedia 
communications. Rather than rigid, costly and inefficient 
specialized processors, the media processor provides a gen- 
eral purpose instruction set to ease prograrnrnability in a 
single device that is capable of performing all of the opera- 
tions of the specialized processor combination. Providing a 
uniform instruction set for all media related operations 
eliminates the need for a programmer to learn several 
different instruction sets, each for a different specialized 
processor The complexity of programming the specialized 
processors to work together and communicate with one 
another is also greatly reduced The unified instruction set is 
also more efficient. Highly specialized general calculation 
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instructions that are tailored to general or special types of 
calculations rather than enhancing communication are elimi- 
nated. 

Moreover, the media processor system can be easily 
5 reprogrammed simply by transmitting or downloading new 
software over the network. In the specialized processor 
approach, new programming usually requires the delivery 
and installation of new hardware. Reprogramming the media 
processor can be done electronically, which of course is 
10 quicker and less costly than the replacement of hardware. 
It is to be understood that a wide range of changes and 
modifications to the embodiments described above will be 
apparent to those skilled in the art and are contemplated. It 
is therefore intended that the foregoing detailed description 
|5 be regarded as illustrative rather than limiting, and that it be 
understood that it is the following claims, including all 
equivalents, that arc intended to define the spirit and scope 
of this invention. 
We claim: 

1. A method for processing unified media data streams. 
20 comprising Che steps of: 

receiving a plurality of unified media data streams trans- 
mitted over a data path, including presentation, trans- 
mission and storage information; 
25 dynamically partitioning the unified media data streams 
based on an elemental symbol width, said elemental 
symbol width being equal to or narrower than the data 
path; and 

processing the unified media data streams at substantially 
30 peak operation; and 

wherein the step of processing the unified media data 

streams further comprises the steps of: 
storing the unified media data streams in a general register 

file; 

35 performing multi-precision operations on the stored uni- 
fied media data streams based on programmed 
instructions, the multi-precision arithmetic operations 
including boolean, integer and floating point math- 
ematical operations; 

40 manipulating component fields of the unified media data 
streams based on programmed instructions mat imple- 
ment copying, shifting and re-sizing operations; and 
performing multi-precision mathematical operations on 
the stored unified media data streams based on pro- 

45 grammed instructions, the mathematical operations 
including finite group, finite ring and table look-up 
operations. 

2. The method defined in claim 1. further comprising the 

pre -fetching instructions and data to fill instruction and 

data pipelines; 
performing memory management operations to retrieve 

instructions and data from external memory; 
55 storing instructions and data in instruction and data cache/ 

buffers; and 

dynamically allocating buffer storage in the instruction 
and data cache/buffers to ensure real-time execution. 

3. The method defined in claim 1. further comprising the 
60 step of providing a set of instructions to process the unified 

media data streams. Che set of instructions including load, 
store, synchronization, branch and gateway instructions. 

4. The method defined in claim 3. further comprising the 
step of programming a sequence of at least one instruction 

65 from the set of instructions. 

***** 
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GENERAL PURPOSE, MULTIPLE 
PRECISION PARALLEL OPERATION, 
PROGRAMMABLE MEDIA PROCESSOR 

This is a divisional of application Ser. No. 08/516.036. 
filed Aug. 16. 1995, now U.S. Pat. No. 5.742.840. 

A Microfiche Appendix consisting of 4 sheets (387 total 
frames) or microfiche is included in this application. The 
Microfiche Appendix contains material which is subject to 
copyright protection. The copyright owner has no objection 
to the facsimile reproduction by any one of the Microfiche 
Appendix, as it appears in the Patent and Trademark Office 
patent files or records, but otherwise reserves all copyright 
rights whatsoever. 

FIELD OF THE INVENTION 

This invention relates to the field of communications 
processing, and more particularly, to a method and apparatus 
for real-time processing of multi-media digital communica- 
tions. 

BACKGROUND OF THE INVENTION 

Optical fiber and discs have made the transmission and 
storage of digital information both cheaper and easier than 
older analog technologies. An improved system for digital 
processing of media data streams is necessary in order to 
realize the full potential of these advanced media. 

For the past century, telephone service delivered over 
copper twisted pair has been the lingua franca of commu- 
nications. Over the next century, broadband services deliv- 
ered over optical fiber and coax will more completely fulfill 
the human need for sensory information by supplying voice, 
video, and data at rates of about 1.000 times greater than 
narrow band telephony. Current general-purpose micropro- 
cessors and digital signal processors ("DSPs**) can handle 
digital voice, data, and images at narrow band rates, but they 
are way too slow for processing media data at broadband 
rates. 

This shortfall in digital processing of broadband media is 
currently being addressed through the design of many dif- 
ferent kinds of application-specific integrated circuits 
("ASICs"). For example, a prototypical broadband device 
such as a cable modem modulates and demodulates digital 
data at rates up to 45 Mbits/sec within a single 6 MHZ cable 
channel (as compared to rates of 28.8 Kbits/sec within a 6 
KHz channel for telephone modems) and transcodes it onto 
a lOVlOObaseT connection to a personal computer ("PC") or 
workstation. Current cable modems thus receive data from 
a coaxial cable connection through a chain of specialized 
ASIC devices in order to accomplish Quadrature Amplitude 
Modification ("QAM*') demodulation. Reed-Solomon error 
correction, packet filtering. Data Encryption Standard 
("DES") decryption, and Ethernet protocol handling. The 
cable modems also transmit data to the coaxial cable link 
through a second chain of devices to achieve DES 
encryption. Reed- Solomon block encoding, and Quaternary 
Phase Shift Keying ("QPSK") modulation. In these 
environments, a general-purpose processor is usually 
required as well in order to perform initialization, statistics 
collection, diagnostics, and network management functions. 

The ASIC approach to media processing has three fun- 
damental flaws: cost, complexity, and rigidity. The com- 
bined silicon area of all the specialized ASIC devices 
required in the cable modem, for example, results in a 
component cost incompatible with the per subscriber price 
target for a cable service. The cable plant itself is a very 
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hostile service environment, with noise ingress, reflections, 
nonlinear amplifiers, and other channel impairments, espe- 
cially when viewed in the upstream direction. Telephony 
modems have developed an elaborate hierarchy of algo- 
5 rithms implemented in DSP software, with automatic reduc- 
tion of data rates from 28.8Kbits/sec to 19.6Kbits/sec. 
14.4 Kbits/sec, or much lower rates as needed to accommo- 
date noise, echoes, and other impairments in the copper 
plant To implement similar algorithms on an ASIC-based 
l0 broadband modem is far more complex to achieve in soft- 
ware. 

These problems of cost, complexity, and rigidity are 
compounded further in more complete broadband devices 
such as digital set-top boxes, multimedia PCs. or video 
j 5 conferencing equipment, all of which go beyond the basic 
radio frequency ("RF') modem functions to include a bioad 
range of audio and video compression and decoding 
algorithms, along with remote control and graphical user 
interfaces. Software for these devices must control what 
20 amounts to a heterogeneous multi-processor, where each 
specialized processor has a different, and usually eccentric 
or primitive, programming environment. Even if these pro- 
gramming environments are mastered, the degree of pro- 
grammability is limited. For example. Motion Picture Expert 
25 Group-I ("MFEG-F) chips manufactured by AT&T Corpo- 
ration will not implement advances such as fractal- and 
wavclct-based compression algorithms, but these chips are 
not readily software upgradcaWe to the MPEG-II standard 
A broadband network operator who leases an MPEG ASIC- 
30 based product is therefore at risk of having to continuously 
upgrade his system by purchasing significant amounts of 
new hardware just to track the evolution of MPEG stan- 
dards. 

The high cost of ASIC-based media processing results 
35 from inefficiencies in bom memory and logic. A typical 
ASIC consists of a multiplicity of specialized logic blocks, 
each with a small memory dedicated to holding the data 
which comprises the working set for that block. The silicon 
area of these multiple small memories is further increased by 
40 the overhead of multiple decoders, sense amplifiers, write 
drivers, etc. required for each logic block- The logic blocks 
arc also constrained to operate at frequencies determined by 
the internal symbol rates of broadband algorithms in order to 
avoid additional buffer memories. These frequencies typi- 
45 cally differ from the optimum speed-area operating point of 
a given semiconductor technology. Interconnect and syn- 
chronization of the many logic and memory blocks are also 
major sources of overhead in the ASIC approach. 
The disadvantages of the prior ASIC approach can be over 
50 come by a single unified media processor. The cost advan- 
tages of such a unified processor can be achieved by 
gathering all the many ASIC functions of a broadband media 
product into a single integrated circuit. Cost reduction is 
further increased by reducing the total memory area of such 
55 a circuit by replacing the multiplicity of small ASIC memo- 
ries with a single memory hierarchy large enough to accom- 
modate the sum total of all the working sets, and wide 
enough to supply the aggregate bandwidth needs of all the 
logic blocks. Additionally, the logic block interconnect 
60 circuitry to this memory hierarchy may be streamlined by 
providing a generally programmable switching fabric. Many 
of the logic blocks themselves can also replaced with a 
single multi-precision arithmetic unit, which can be inter- 
nally partitioned under software control to perform addition. 
65 multiplication, division, and other integer and floating point 
arithmetic operations on symbol streams of varying widths, 
while sustaining the full data throughput of the memory 
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hierarchy. The residue of logic blocks lhat perform opera- 
tions that are neither arithmetic or permutation group ori- 
ented can be replaced with an extended math unit thai 
supports additional arithmetic operations sjch as finite field, 
ring, and table lookup, while also sustaining the full data 
throughput of the memory hierarchy. 

The above multi-precision arithmetic, permuraiion 
switch, and extended math operations can then be organized 
as machine instructions that transfer their operands to and 
from a single wide multi-ported register file. These instruc- 
tions can be further supplemented with load/store instruc- 
tions that transfer register data to and from a data buffer/ 
cache static random access memory ("SRAM") and main 
memory dynamic random access memories ("DRAMs"). 
and with branch instructions that control the flow of instruc- 
tions executed from an instruction buffer/cache SRAM. 
Extensions to the load/store instructions can be made for 
synchronization, and to branch instructions for protected 
gateways, so that multiple threads of execution for audio, 
video, radio, encryption, networking, etc. can efficiently and 
securely share memory and logic resources of a unified 
machine operating near the cjptimum speed-area point of the 
target semiconductor process. The data path for such a 
unified media processor can interface to' a high speed 
input/output ("I/O") subsystem that moves media streams 
across ultra-high bandwidth interfaces to external storage 
and I/O. 

Such a device would incorporate all of the processing 
capabilities of the specialized multi-ASIC combination into 
a single, unified processing device. The unified processor 
would be agile and capable of reprogramming through the 
transmission of new programs over the communication 
medium. This programmable, general purpose device is thus 
less costly than the specialized processor combination, 
easier to operate and reprogram and can be installed or 
applied in many differing devices and situations. The device 
may also be scalable to rommuni cations applications that 
support vast numbers of users through massively parallel 
distributed computing. 

It is therefore an object of this invention to process media 
data streams by executing operations at very high bandwidth 
rates. 

It is also an object of this invention to unify the audio, 
video, radio, graphics, encryption, authentication, and net- 
working protocols into a single instruction stream. 

It is also an object of this invention to achieve high 
bandwidth rates in a unified processor that is easy to 
program and more flexible man a heterogeneous combina- 
tion of special purpose processors. 

It is a further object of the invention to support high level 
mathematical processing in a unified media processor, 
including finite group, finite held, finite ring and table 
look-up operations, all at high bandwidth rates. 

It is yet a further object of the invention to provide a 
unined media processor that can be replicated into a multi- 
processor system to support a vast array of users. 

It is yet another object of this invention to allow for 
massively parallel systems within the switching fabric to 
support very large numbers of subscribers and services. 

It is also an object of the invention to provide a general 
purpose programmable processor lhat could be employed at 
all points in a network. 

It is a further object of this invention to sustain very high 
bandwidth rates to arbitrarily large memory and input/ourpui 
systems. 
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SUMMARY OF THE INVENTION 

In view of the above, there is provided a system for media 
processing that maintains substantially peak data throughput 
in the execution and transmission of multiple media data 
streams. The system includes in one aspect a general 
purpose, programmable media processor, and in another 
aspect includes a method fur receiving, processing and 
transmitting media data streams. The general purpose, pro- 
} grammable media processor of the invention further 
includes an execution unit, high bandwidth external 
interface, and can be employed in a parallel multi-processor 
system. 

According to the apparatus of the invention, an execution 

. unit is provided that maintains substantially peak data 
throughput in the unified execution of multiple media data 
streams. The execution unit includes a data path, and a 
multi-precision arithmetic unit coupled to the data path and 
capable of dynamic partitioning based on the elemental 

! width of data received from the data path. The execution unit 
also includes a switch coupled to the data path that is 
programmable to manipulate data received from the data 
path and provide data streams to the data path. An extended 
mathematical element is also provided, which is coupled to 

, the data path and programmable to implement additional 
mathematical operations at substantially peak data through- 
put. In a preferred embodiment of the execution unit, at least 
one register file is coupled to the data path. 

According to another aspect of the invention, a general 

i purpose programmable media processor is provided having 
an instruction path and a data path to digitally process a 
plurality of media data streams. The media processor 
includes a high bandwidth external interface operable to 
receive a plurality of data of various sizes from an external 

; source and communicate the received data over the data path 
at a rate that maintains substantially peak operation of the 
media processor. At least one register file is included, which 
is configurable to receive and store data from the data path 
and to communicate the stored data to the data path. A 

i multi-precision execution unit is coupled to the data path 
and is dynamically configurable to partition data received 
from the data path to account for the elemental symbol size 
of the plurality of media streams, and is programmable to 
operate on the data to generate a unified symbol output to the 
data path. 

According to the preferred embodiment of the media 
processor, means are included for moving data between 
registers and memory by performing load and stare 
operations, and for coordinating the sharing of data among 
a plurality of tasks by performing synchronization opera- 
tions based upon instructions and data received by the 
execution unit Means arc also provided for securely coo- 
trolling the sequence of execution by performing branch and 
gateway operations based upon instructions and data 
received by the execution unit A memory management unit 
operable to retrieve data and instructions for timely and 
secure communication over the data path and instruction 
path respectively is also preferably included in the media 
processor. The preferred embodiment also includes a com- 
bined instruction cache and buffer that is dynamically allo- 
cated between cache space and buffer space to ensure 
real-time execution of multiple media instruction streams, 
and a combined data cache and buffer that is dynamically 
allocated between cache space and buffer space to ensure 
real-time response for multiple media data streams. 

In another aspect of the invention, a high bandwidth 
processor interface for receiving and transmitting a media 
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stream is provided having a data path operable to transmit 
media information at sustained peak rates. The high band- 
width processor interface includes a plurality of memory 
controllers coupled in series to communicate stored media 
information to and from the data path, and a plurality of 5 
memory elements coupled in parallel to each of the plurality 
of memory controllers for storing and retrieving the media 
information. In the preferred embodiment of the high band- 
width processor interface, the plurality of memory control- 
lers each comprise a paired link disposed between each JC 
memory controller, where the paired links each transmit and 
receive plural bits of data and have differentia] data inputs 
and outputs and a differential clock signal. 

Yet another aspect of the invention includes a system for 
unified media processing having a plurality of general 15 
purpose media processors, where each media processor is 
operable at substantially peak data rates and has a dynami- 
cally partitioned execution unit and a high bandwidth inter- 
face for communicating to memory and input/output ele- 
ments to supply data to the media processor at substantially 
peak rates. A bi-directional communication fabric is 
provided, to which the plurality of media processors are 
coupled, to transmit and receive at least one media stream 
comprising presentation, transmission, and storage media 
informatioa The bi-directional communication fabric pref- 
erably comprises a fiber optic network, and a subset of the 
plurality of media processors comprise network servers- 
According to yet another aspect of the invention, a 
parallel multimedia processor system is provided having a 
data path and a high bandwidth external interface coupled to 
the data path and operable to receive a plurality of data of 
various sizes from an external source and communicate the 
received data at a rate that maintains substantially peak 
operation of the parallel multi-processor system. A plurality 
of register files, each having at least one register coupled to 
the data path and operable to store data, are also included. At 
least one multi-precision execution unit is coupled to the 
data path and is dynamically configurable to partition data 
received from the data path to account for the elemental 
symbol size of the plurality of media streams, and is 
programmable to operate in parallel on data stored in the 
plurality of register files to generate a unified symbol output 
for each register file. 

According to the method of the invention, unified streams 
of media data are processed by receiving a stream of unified 
media data including presentation, transmission and storage 
information. The unified stream of media data is dynami- 
cally partitioned into component fields of at least one bit 
based on the elemental symbol size of data received. The 
unified stream of media data is then processed at substan- 
tially peak operation. 

In one aspect of the invention, the unified stream of media 
data is processed by storing the stream of unified media data 
in a general register file. Multi-precision arithmetic opera- 
tions can then be performed on the stored stream of unified 
media data based on programmed instructions, where the 
multi-precision aritluuctic operations include Boolean, inte- 
ger and Moating point mathematical operations. The com- 
ponent fields of unified media data can then be manipulated 
based on programmed instructions that implement copying, 
shifting and re- sizing operations. Multi-precision math- 
ematical operations can also be performed on the stored 
stream of unified media data based on programmed 
instructions, where the mathematical operations including 
finite group, finite field, finite ring and table look-up opera- 
tions. Instruction and data pre-fetching are included to fill 
instruction and data pipelines, and memory management 
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operations can be performed to retrieve instructions and data 
from external memory. The instructions and data are pref- 
erably stored in instruction and data cache/buffers, in which 
buffer storage in the instruction and data cache/buffers is 
dynamically allocated to ensure real-time execution. 

Other aspects of the invention include a method for 
achieving high bandwidth communications between a gen- 
eral purpose media processor and external dev ices by pro- 
viding a high bandwidth interface disposed between the 
media processor and the external devices, in which the high 
bandwidth interface comprises at least one uni-directional 
channel pair having an input port and an output port. A 
plurality of media data streams, comprising component 
fields of various sizes, are transmitted and received between 
the media processor and the external devices at a rate that 
sustains substantially peak data throughput at the media 
processor. A method for processing streams of media data is 
also included that provides a bi-directional communications 
fabric for transmitting and receiving at least one stream of 
media data, where the at least one stream of media data 
comprises presentation, transmission and storage informa- 
tion. At least one programmable media processor is provided 
within the communications network for receiving, process- 
ing and transmitting the at least one stream of unified media 
25 data over the bidirectional communications fabric. 

The general purpose, r^ogrammable media processor of 
the invention combines in a single device all of the necessary 
hardware included in the specialized processor combi na- 
tions to process and communicate digital media data streams 
30 in real-time. The general purpose, programmable media 
processor is therefore cheaper and more flexible than the 
prior approach to media processing. The general purpose, 
programmable media processor is thus more susceptible to 
incorporation within a massively parallel processing net- 
35 work of general purpose media processors that enhance the 
ability to provide real-time multi media communications to 
the masses. 

These features are accomplished by deploying server 
media processors and client media processors throughout the 
40 network. Such a network provides a seamless, global media 
super-computer which allows programmers and network 
owners to virtualize resources. Rather than restrictrvely 
accessing only the memory space and processing time of a 
local resource, the system allows access to resources 
45 throughout the network. In small access points such as 
wireless devices, where very little memory and processing 
logic is available due to limited battery life, the system is 
able to draw upon the resources of a homogeneous multi- 
computer system, 
so The invention also allows network owners the facility to 
track standards and to deploy new services by broadcasting 
software across the network rather man by instituting costly 
hardware upgrades across the whole network. Broadcasting 
software across the network can be performed at the end of 
55 an advertisement or other program that is broadcasted 
nationally. Thus, services can be advertised and then trans- 
mitted to new subscribers at the end of the advertisement 
These and other features and advantages of the invention 
will be apparent upon consideration of the following 
60 detailed description of the presently preferred errubodimcnts 
of the invention, taken in conjunction with the appended 
drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 
65 FIG- 1 is a block diagram of a broad band media computer 
employing the general purpose, programmable media pro- 
cessor of the invention; 
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FIG. 2 is a block diagram of a global media processor FIG. 19 shows a presentiy preferred multi-chip imple- 

employing multiple general purpose media processors mentation of the general purpose, programmable media 

according to the invention; processor of the invention. 

RG. 3 is an illustration of the digital bandwidth spectrum 

for telecommunications, media and computing cornmunica- 5 DETAILED DESCRIPTION OF THE 

uons; PRESENTLY PREFERRED EMBODIMENTS 

FIG. 4 is the digital bandwidth spectrum shown in FIG. 3 Referring to the drawings, where like-reference numerals 

taking into account the bandwidth overhead associated with refer to like elements throughout, a broad band micrucom- 

compressed video techniques: puter 10 is provided in FIG. 1. The broad band roicrocom- 

FIG. 5 is a block diagram of the current specialized 10 P utei 10 consists essentially of a general purpose media 

processor solution for mass media communication, where processor 12. As will be described in more detail below, the 

FIG. 5 shows the current distributed system, and shows a general purpose media processor 12 receives, processes and 

possible integrated approach; transmits media data streams in a bidirectional manner from 

FIG. 6 is a block diagram of two presently preferred 15 u P strea ™ "**o* components to downstream devices. In 

general purpose media processors, where FIG. > shows a geD ! raL media Streains . rcce,ved from ^ stTCam «"* 

distributed system and shows an integrated media processor; W ° rk «^nente cu comprise any combination of audio, 

— , - . . . . r T , r . video, radio, graphics, encryption, authentication, and net- 

FIG. 7 is a block diagram of the presently preferred workmg information . As ^ skillcd ln mc ^ wi|j 

structure of a general purpose, programmable media pro- apprcciate . howeV er. the general purpose media processor 

cessor according to the invention; 20 12 is m DO way ^6 to rec^iviig. processing and trans- 

FIG. 8 is a drawing consisting of visual illustrations of the mining only these types of media information. The general 

various group operations provided on the media processor. purpose media processor 12 of the invention is capable of 

where FIG. 8(a) illustrates the group expand operation. FIG. processing any form of digital media information without 

8(fr) illustrates the group compress or extract operation. FIG. departing from the spirit and essential scope of the inven- 

8(c) illustrates the group deal and shuffle operations. FIG. 25 tion. 
8(<r) illustrates the group swizzle operation and FIG. 8<*) 

illustrates the various group permute operations; System Configuration 

FIG. 9 shows the preferred instruction and data sizes for in the preferred embodiment of the invention shown in 

the general purpose, programmable media processor, where FIG. 1. media data streams are communicated to the media 

FIG. 9{a) is an illustration of the various instruction formats 30 processor 12 from several sources. Ideally, unified media 

available on the general purpose, programmable media data streams are received and transmitted by the general 

processor. FK3. 9{b) illustrates the various floating-point purpose media processor 12 over a fiber optic cable network 

data sizes available on the general purpose media processor. 14. As will be described in more detail below, althougi a 

and FIG. 9(c) illustrates the various fixed-point data sizes fiber optic cable network is preferred, the presentiy existing 

available on the general purpose media processor; communications network in the United States consists of a 

FIG. 10 is an illustration of a presently preferred memory combination of fiber optic cable, coaxial cable and other 

management unit included in the general purpose processor transmission media. Consequently, the general purpose 

shown in FIG. 7. where FIG. 10(a) is a translation block media processor 12 can also receive and transmit media data 

diagram and FIG. 10(l>) illustrates the functional blocks of ^ streams over coaxial cable 14 and traditional twisted pair 

the transaction lookaside buffer; wire connections 16. The specific communications protocol 

FIG. 11 is an illustration of a super-string pipeline tech- employed over the twisted pair 16. whether POTS, ISDN or 

nique; ADSL, is not essential; all protocols are supported by the 

FIG. 12 is an illustration of the presentiy preferred super- broad band ^crocomputer lO^ejletails of these protocols 

spring pipeline technique; 45 gcncxal,y known 10 thosc $kflkd m the art and no further 

mr> ui c • . . discussion is therefore needed or provided herein. 

FIG. 13 is a block diagram of a single memory channel for . ^ 

communication to the general purpose media processor §_ Another form of upstream network communication is 

shown in FIG. 7 through a satellite link 18. The satellite link 18 is typically 

ci/- 14 - -ii^-*; .1 _r j connected to a satellite receiver 20. The satellite receiver 20 

FIG 14isanWus^ comprises an antenna, usually in the form of a satellite dish, 

nection of standard memory devices lo the preferred 50 ^ ^ mczSi ^ circuitry. Hie details of such satellite 

memory in ace. communications are also generally known in the art. and 

FIG. IS is a block diagram of the input/output controller further detail is therefore not provided or included herein, 

for use with the memory channel shown in RG. 13; M described above, the general purpose media processor 

FIG. 16 is a block diagram showing multiple memory 55 \ 2 communicates in a bidirectional manner to receive, 
channels connected to the general purpose media processor process and transmit media data streams to and from down- 
shown in FIG. 7. where FIG. 16(c) shows a two-channel stream devices. As shown in FIG. 1. downstream commu- 
implementation and FIG. 16(b) illustrates a twelve-channel nicauon preferably takes place in at least two forms. First, 
embodiment; media data streams can be communicated over a 

FIG. 17 illustrates the presently preferred packet commu- $o bi-directional local network 22. Various types of local net- 

ni cations protocol for use over the memory channel shown works 22 are generally known in the art and many different 

in FIG. 13; forms exist. The general purpose media processor 12 is 

FIG. 18 shows a multi-processor configuration employing capable of communicating over any of these local networks 

the general purpose media processor shown in RG. 7. where 22 and the particular type of network selected is implemen- 

FIG. 18(a) shows a linear processor configuration. RG. 65 Nation specific. 

18(M shows a processor ring confirmation, and RG. 18(c) The local network 22 is preferably employed to commu - 

shows a two-dimensional processor configuration: and nicate between the unified processor 12, and audio/visual 
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devices 24 or other digital devices 26. Presently preferred For example, the network servers 46 can be used to interface 

examples of audio/visual devices 24 include digital cable between the fiber optic network 40 and twisted pair wiring 

television, video-on-demand devices, electronic vellow 44. Twisted pair wiring 44 is .still employed for smaU 

pages services, integrated message svstems. video businesses SO and residential locations 52 thai do not or 
telephones, video games and electronic program guides. As 5 cannot currently subscribe to coaxial cable or fiber optic 

those skilled in the art will appreciate, other forms of newo * ^f ne P«»j» e media processors 12 are 

audio/video devices are contemplated within the spirit and »«*> Apposed/ these small bus.ness locattons 50 and 

scope of the invention. Present preferred emhodiinents of "™*> k ™^J^°£ ^35^ 
. ..... . ^, . • k u » processors 12 are also installed in wireless or mobile loca- 

other dtgdal devices 26 for oommunjeaUon with the general P network 3g 

purpose rnema processor 12 include personal computers, to sho £ Aj shown ^ nG 2 

television sets, work station, dig.lal video camera ^ peripherals 56 can also coupled to general 

recorders, and compa« disc readonly memor.es As those ^ £ network ^8. 

skilled in the art will also appreciate, further digital devices t * _ . U1 

26 are contemplated for communication to the general ™ e general purpose media processor 12 is operable at 
purpose media processor 12 without departing from the 15 significantly high bandwidlhs in order to receive, process 
spirited scope of the invention. and ttansimt um * ed media ^ ta strcams Referring to VIC. 

r _ , , 3. the respective frequenaes for various types of media data 

Second, the general purpose media processor preferably stieams are se4 furlh agaiQSt a bandwidth spectrum 60. The 
aisocormr^cateswim bandwidth spectrum 60 includes three component 

network 2S. In the presently preferred embodiment of the alt along the same range of frequencies, which 

invention, wireless devices for communicauon over the 20 r ;T eseDt ^ various f^ qiienC y rates of digital media com- 
wireless network 28 can comprise either rernote communi- raiJmcatioDS Currcnt cora p Utin g bandwidth capabilities are 
cation devices30 or remote cornputmg devices 32. Presently ^ ^ ^ tclecommimications spectrum 62 shows 

preferred embodiments of the remote communicaUons ^ yarious fr bands used for telecommunications 

devices 30 include cordless telephones and personal com- ^nsiDission. For example, teletype terminals and modems 
municators Presently preferred embodiments of the remote 25 ^ & w bils/second to 

computing devices 32 include remote controls and telecom- J6 ^its/second. The ISDN telecommunication protocol 
municating devices. As those skilled in the art wiU es aJ ^ wdbMxnmd. At the upper end of the 

appreciate, other forms of remote communication devices 30 telecommuoications spectrum 62. TI and T3 trunks operate 
and remote computing .devices 32 are capable of communi- m oqc ^ and n racgabits sccoad , 

cation withthe general purpose media processor U without *> ^^ vd ^ SO NCT frequency range extends from 
departing from the sprnl and scope of the invention. An agile ^xirnately 1 28 megabits per second up to approximately 
digital radio (not shown) that incorporates a general purpose gg&bits per second. Accordingly, in order to carry such 
^.^^^ 12may 10 comjnumcate Wlth mesC broad band communicaUons. the general purpose media 

35 processor 12 is capable of fransfeiring information at rates 
into the gigabits per second range or higher. 

A spectrum of typical media data streams is presented in 
Referring now to FIG. 2. the general purpose media the media spectrum 64 shown in FIG. 3. Voice and music 
processor 12 is preferably disposed throughout a digital transmissions are centered at frequencies of approximately 
communications network 38. In order to enable communi- ^ 64 kilobits per second and one megabit per second, re spec- 
cation among large and small businesses, residential cus- tivery. At the upper end of the media spectrum 64. video 
tomers and mobile users, the network 38 can consist of a transmission takes place in a range from 128 megabits per 
combination of many individual subnetworks comprised of second for high density television up to over 256 gigabits per 
three main forms of interconnection. The trunk and main second for movie applications. When using common video 
branches of the network 38 preferably employ fiber optic 45 compression techniques, however, the video transmission 
cable 40 as the preferred means of interconnection. Fiber spectrum can be shifted down to between 32 kilobits per 
optic cable 40 is used to connect between general purpose second to 128 megabits per second as a result of the data 
media processors 12 disposed as network servers 46 or large compression. As described below, the processing required to 
business installations 48 thai are capable of coupling directly achieve the data compression results in an increase in 
to the fiber optic link 40. For communications to small ^ bandwidth requirements. 

business and residential customers that may be incapable of Current computing bandwidths are shown in the comput- 
directly coupling to the fiber optic cable 40. a general m g spectrum 66 of FIG. 3. Serial communications presently 
purpose media processor 12 can be used as an interface to take place in a range between two kilobits per second up to 
other forms of network interconnection. 5t2 kilobits per second. The Ethernet network protocol 

As shown in FIG. 2. alternate forms of interconnection 55 operates at approximately* 8 megabits per second. Current 
consist of coaxial cable lines 42 and twisted pair wiring 44. dynamic random access memory and other digital input/ 
Coaxial cable lines arc currently in place throughout the output peripherals operate between 32 megabits per second 
U.S. and is typically employed 10 provide cable television and 512 megabits per second. Presently available micropro- 
services to residential homes. According to the preferred cessors are capable of operation in the low gigabits per 
emfXKliment of the invention, general purpose media pro- 60 second range. For example, the '386 Pentium microproces- 
cessors 12 can be installed at these residential locations 5Z sor manufactured by Intel Corporation of Santa Clara. Calif. 
In contrast to the specialized processor approach, the general operates in the lower half of that range, and the Alpha 
purpose media processor 12 provides enough bandwidth to microprocessor manufactured by Digital Equipment Corpo- 
allow for bi-directional communications to and from these ration approaches the 16 gigabits per second range, 
residential locations 52. M When video compression is employed, as expressed 

Network servers 46 controlled by general purpose media above, the associated processing overhead reduces the effec- 
processors 12 are also employed throughout the network 38. live bandwidth of the particular processor. As a result in 



wireless devices. 

Network Configuration 
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order to handle compressed video, these processors must core of the integrated general purpose media processor 12 

operate in the terahertz frequency range. The bandwidth comprises an execution unit 100. Three main elements or 
spectrum 60 shown in FIG. 4 represents the effect of subsections arc included in the execution unit 100. A mul- 

handling media data streams including compressed video. tiple precision arithmetic/logic unit ("ALU") 102 performs 

The computing spectrum 66 is skewed down to properly 5 ^ logical and simple arithmetic operations on incoming 

align the computing bandwidth requirements with the tele- media data streams. Such operations consist of calculate and 

communications spectrum 62 and the media spectrum 64. control operations such as Boolean functions, as well as 

Accordingly, current processor technology is nor sufficient addition, subtraction, multiplication and division. These 

to handle the transmission and processing associated with operations are performed on single or unified media data 

complex streams of multi-media data io strcams transmitted to and from the multiple precision ALU 

The current specialized processor approach to media „c ^ T 

processing is illustrated in the block diagram shown in FIG. * € * 1S - 128 * ltS ^ * & ™^ A l o h0&c skUled 10 m wlU 

5. As shown in FIG. 5. special pirpose processors are a P P ™ a ' C ***** ^ 10 * can on aD >' width OT 

coupled to a back plane 70, which iscapable of transmitting *^*P«as from the spirit and scope of the 

instructions and data at the upper kilobiuto lower gigabit! 15 in ™ u «f ^ w * cr * c *** P«* ** ™ct unified 

per second range. In a typical configuration. aiT audio med * te V™™? 6 10 * ** 

processor 76. video processor 78. graphics processor 80 and i processor 12. 

network processor «2 are all coupled to the back plane 70. Coupled to the multi-precision ALU 102 via the data path 

Each of the audio, video, graphics and network processors 108 * and ^ an dcment of me execution unit 100. is a 

76-32 typically employ their own private or dedicated 20 Flammable switch 104. The programmable switch 104 

memories 84. which are only accessible to the specific performs data handling operations on single or unified media 

processor and not accessible over the back plane 70. As ^ sticarnstiaiisnutted over the data path 108. Examples of 

described above, however, unless video data streams are such **** operations include deals, shuffles, shifts, 

consr^dy being processed, for example, the video processor ex P ands - compresses, swizzles, permutes and reverses. 

78 will sit idle for periods of time. The computing power of 25 although other data handling operations are contemplated, 

the dedicated video processor 78 is thus only available to °P erati< > n s can be performed on single bits or bit fields 

handle video data strcams and is not available to handle consi5tifl S <* ^0 or more bits up to the entire width of the 

other media data streams that are directed to other dedicated *"* ^ 108 Tbus ' sm # e bits or bit fields of various sizes 

processors. This, of course, is an inefficient use of the video Car ! n*^ 1 ^ programmable operation of the 

processor 78 particularly in view of the overall processing 30 SWllch , * 4 - 

capability of this multi -processor system. Examples of the presently preferred data manipulation 

H.e general purpose ^ accessor 12. in contrast. ^ P u * Me J» edi » P^f" 

handles a data stream of audio, video, graphics and network " *?„ SboWD " i . ™L 8 - A &™P ex P and °P eraQon «* 

information ail at the same time with theCe pressor. In ""J » ftV^S 

order to handle the ever changing combination of data types. 55 ^ < f atl0 «' sequeiinal fieM of bits 278 can be 

the general purpose media processor 12 is dynami^Uy ^fT**""' ^? dds to 

partitiooablc to allocate the appropriate amount of process ' nt ° * ^«cr fidd array 274. The revere of the group expand 

ing for each combination of media in a unified media data T * """^ ^ ***** °P eraU0D - Avisual 
stream. A block diagram of two preferred general purpose £r ' fT" * """^"J' 
media processor system configurations is shown in^6. 40 ^ WD ™[ *L f T^J**"™ 
Referring to FIG. 6. a general purpose media processor 12 * " ™ ™. ^J^** «° 
is coupled to a high-speed back plane 90. The presently fom 8 CODUgUOUS 01 S€< > aennal fie,d of blU m 
preferred back plane 90 is capable of operation at 30 gigabits Ref erang to FIGS. 8(c>-S(e). group deal, shuffle, swizzle 
per second. As those skilled in the art will appreciate back permute operations performed by the programmable 
planes 90 mat are capable of operation at 400 gigabits per 45 5 ™ U * ,M ^ il)uarittd - Ttc operations performed by 
second or greater bandwidth are envisioned within the spirit instructions are readily understood from a review of 
and scope of the invention. Multiple memory devices 92 are the drawm 8 s - The group manipulation operations illustrated 
also coupled io the back plane 90. which are accessible by in FIGS - *<°>- 8 ( e ) comprise the presently contemplated data 
the general purpose media processor 12. Input/output taanipulation operations for the general purpose media pro- 
devices 94 are coupled to the back plane 90 through a 50 " SS ? li M those skmed art will appreciate, either 
dual-ported memory 92. The configuration of the input/ * subset 01 operations or additional data manipulation 
output devices 94 on one end of the dual-poned memory 92 "P 030005 «* mcoiporated in other alternate embodi- 
allows toe sharing of these memory devices 92 throughout ments of 8 entTa] pmpos* media processor 12 without 
a network 38 of general purpose media processors 12 departing from the spirit and scope of the invention. 

Alternatively. FIG. 6 shows a presendy prefemd ime- 53 " I £ D "J?' "T* mam . ema,icaI 

grated general'purpose media processor 12. The integrattd ^^"^P^^ by the multi-precision ALU 

processor includes oo-board memory and I/O St The )* 2 J^ 'V^f^ ^fj^ prOCCS f a ' 

on-board memory is preferably of suffide.t size to optimize l2 «*<*>& ^ , 

throughput, and can romprise a cache and/or buffer rueZy „ efcmCB L L» ^ «° ^^g* 108 J and 4)50 

or the lie. The integrat^medta processor 12 also concerts * corpses p^ of the execuuon unit 109 The extended math 

to external memea/S. whid, isVdtaably larger thaTd* ekm "* T arithmctic °P OT " ODS 

on-board memory 86 and forms the system main memory. ^^^Z^S^ 1 ^ 

Execution Unit of an extendcd operation comprises a Galois field 

65 operation. Other examples of extended mathematical ftinc- 

Onc presently preferred embodiment of an integrated tions performed by the extended math element 106 include 

general purpose media processor 12 is shown in FIG. 7. The CRC generation and checking. Reed- Solomon code genera- 
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lion and checking, and spread- spectrum encoding and 
decoding. As those skilled in the art appreciate, additional 
mathematical operations are possible and contemplated. 

According to the preferred embodiment of the integrated 
general purpose media processor 12. a register file 110 is 
provided in addition to the execution unit 100 to process 
media data. The register fUe 110 stores and transmits data 
streams to and from the execution unit 100 via the data path 
108. Rather than employing a complex set of specific or 
dedicated registers, the general purpose media processor 12 
preferably includes 64 general purpose registers in the 
register file 110 along with one program counter (not 
shown). The 64 general purpose registers contained in the 
register file 110 are all available to the user/programmer, and 
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appreciate that a great number of programs are possible 
through various sequences of instructions. Particular pro- 
grams can be developed for each unique implementation of 
the general purpose media processor 12. A detailed discus- 
sion of such specific programs is therefore beyond the scope 
of this description. 

One presently preferred instruction set for the general 
purpose media processor 12 is included in the Microfiche 
3 Appendix, the contents of which are hereby incorporated 
herein by reference. A list of the presently preferred major 
operation codes for the general purpose media processor 12 
appears below in Table I. 



TABLE I 

MAJOR OPERATION CODES 



MAJOR 0 


32 


64 


96 


128 


160 


192 


224 


0 


ERES 


GSHIJFFLEI 


FMULADD16 


GMULADD1 


LU16LA1 


SAAS64LAI 


EADDIO 


BFE16 


1 


ESHUFFLE- 


GSHUFFLE- 


FMULADD32 


GMLJLADD2 


LU16BAI 


SAAS64BAI 


EADDIUO 


BFNUE16 




I4MUX 


I4MUX 












BFNUGE16 


2 




GSELECT8 


FMULADD64 


GMULADD4 


LU16LI 


SCAS64LAJ 


ESETTL 


'3 


EMDEPI 


GMDEPI 




GMULADD3 


LU16BI 


SCAS64BA1 


ESETK5E 


BFNUL16 


A 


EMUX 


GMUX 


FMULSUB16 


GMULABD16 


LU32LAI 


SMAS64LAI 


ESETTE 


BFE32 


5 


EBMUX 


G8MUX 


FMULSUB32 


GMULADD32 


LU32BAI 


SMAS64BA1 


ESETTNE 


BFNUE32 


6 


ECFMUI64 


GGFMUL8 


FMULSUB64 


GMUIADD64 


LU32U 


SMUX64LAI 


ESETTUL 


RFNUGE32 


7 
8 


ETRANSPOSE- 
iMUX 


GTRANSPOSE 
8 MUX 




GEXTRACT128 


LU32BJ 
LI6LA1 


SM1JX64RAT 
S16LAI 


ESETOJGE 
ESUBIO 


BFNUL32 
BFE64 


9 


ESWTZZLE 


GSWTZZLE 




GUMULADD2 


L16BA1 


S16BAI 


ESUBIUO 


BFNUE64 


10 




GSWTZZLECOPY 




GUMULADD4 


L16L1 


S16U 


ESUB1L 


BFNUGE64 


11 




GSWT2ZLESWAP 




GUMULADD8 


L16BI 


SI6B1 


E5UBKTE 


BFNTJL64 


12 


EDEFI 


GDEPI 


F.16 


GUMULADD16 


L32LA1 


S32LAI 


ESUB1E 


BFE128 


13 


EUDEPI 


GUDEPI 


F.32 


GUMULAD032 


L32BAI 


S32BAJ 


ESUBINE 


BFNUE123 


14 


EWTHQ 


GWTHI 


F.64 


GUMIOADD64 


L32LI 


S32LI 


ESUBIUL 


BFNUGE128 


15 


EUWTH1 


GUWTHI 




GUEXTRACT128 


L32BI 


S32B1 


ESUBIUGE 


BFNUL123 


16 




GFMULADD16 


GEXTRACTI 


L64LAI 


S64LAI 


EADDI 


BANDE 


17 






GFMULADD32 


GEXTRACT116 


L64BAI 


S64BAI 


KXOKJ 


BANDNE 


18 






GFMULABD64 


GEXTRACTJ32 


L64LI 


S64U 


EORI 


BL/BLZ 


19 






GFMULADD128 


GUEXTRACT164 


L64BI 


S64BI 


EAND1 


BGE/BGEZ 


20 






GFMULSUB16 


G EXTRACT 


L128LA1 


S128LAJ 


ESUB1 


BE 


21 






GFMULSUB32 


1.64 


L128BAI 


S128BA1 




BNE 


22 






GFMULSUB64 


GEXTRACT 


L128L1 


S12SU 


ENOR1 


BUL/BGZ 


23 






GFMULSUBI28 


1128 


L128BI 


S128B1 


ENANDI 


BUGE7BLEZ 


24 








G.l 


LB1 


SB3 




BGATEI 


25 








G.2 


LUBI 








26 








G.4 










27 
28 




ECOPYI 


CPAS 


G.8 
G.16 






ECOPYI 


BI 


29 






GF.32 


G32 








BLINKJ 


30 






GKM 


G64 










31 




EJUJNOR 


GF.128 


G.12S 


LJVONOR 


S -MINOR 


E MINOR 


B.MINOR 



major operation code field values 



comprise a portion of the user state of the general purpose 
media processor 12. The general purpose registers are pref- 
erably capable of storing airy form of data. Each register 
within the register file 110 is coupled to the data path 108 
and is accessible to the execution unit 100 in the same 
manner. Thus, the user can employ a general purpose 
register according to the specific needs of a particular 
program or unique application. As those skilled in the art 
will appreciate, the register file 110 can also comprise a 
plurality of register files 110 configured in parallel in order 
to support parallel multi -threaded processing. 

Instruction Set and User Progiarnrning 

Control or manipulation of data processed by the general 
purpose media processor 12 is achieved by selected instruc- 
tions programmed by the user. Those skilled in the art will 



50 

As shown in Table L the major operation codes are grouped 
according to the function performed by the operations. The 
operations are thus arranged and listed above according to 
the presently preferred operation code number for each 

55 instruction. As many as 255 separate operations are con- 
templated for the preferred ejrjbodimcnt of the general 
purpose media processor 1 2. As shown in Table L however, 
not all of the operation codes are presently implemented. As 
those skilled in the art will appreciate, alternate schemes for 

60 organizing the operation codes, as well as additional opera- 
tion codes for the general purpose media processor 12. are 
possible. 

The instructions provided in the instruction set for the 
general purpose media processor 12 control the transfer. 
65 processing and manipulation of data streams between the 
register tile 110 and the execution unit 100. The presendy 
preferred width of the instruction path 112 is 32 -bits wide. 
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organized as four eight-bit bytes ("quadJets"). Those skilled 
in the art will appreciate, however, that the instruction path 
112 can take on any width without departing from the spirit 
and scope of the invention. Preferably, each instruction 
within the instruction set is stored or organized in memory 
on four-byte boundaries. The presently preferred format for 
instructions is shown in FIG. 9{a). 

As shown in FIG. 9(a). each of the presently preferred 
instruction formats for the general purpose media processor 
12 includes a field 280 for the major operation code number 
shown in Table I Based on the type of operation performed, 
the remaining bits can provide additional operands accord- 
ing Io the type of addressing employed with the operation. 
For example, the remainder of the 32-bit instruction field can 
comprise an immediate operand ( 44 imm w ), or operands stored 
in any of the general registers ('Ya "rb." "re." and 4 rd"). In 
addition, minor operation codes 282 can also be included 
among the operands of certain 32-bit instruction formats. 

The presently preferred embodiment of the general pur- 
pose media processor 12 includes a limited instruction set 
similar to those seen in Reduced Instruction Set Computer 
CRISC) systems. The preferred instruction set for the 
general purpose media processor 12 shown in Table I 
includes operations which implement load, store, 
synchronize, branch and gateway functions. These five 
groups of operations can be visually represented as two 
general classes of related operations. The hranch and gate- 
way operations perform related functions on media data 
streams and are thus visually represented as block 114 in 
FIG. 7. Similarly, the load, store and synchronize operations 
are grouped together in block 116 and perform similar 
operations oo the media data streams. (Blocks 114 and 116 
only represent the above classification of these operations 
and their function in the processing of media data streams, 
and do not indicate any specific underlying electronic 
connections.) A more detailed discussion of these 
operations, and the functionality of the general purpose 
media processor 12. appears in the Microfiche Appendix. 

The four-byte structure of instructions for the general 
purpose media processor 12 is preferably independent of the 
byte ordering used for any data structures. Nevertheless, the 
gateway instructions are specifically defined as 16-byte 
structures containing a code address used to securely invoke 
a procedure at a higher privilege level. Gateways arc pref- 
erably marked by protection information specified in the 
translation lookaside buffer 148 in the memory management 
unit 122. Gateways are thus preferably aligned on 16-byte 
boundaries in the external memory. In addition to the general 
purpose registers and program counter, a privilege level 
register is provided within the register file 110 that contains 
the privilege level of the currently executing instruction. 

The instruction set preferably includes load and store 
instructions that move data between memory and the register 
file 110. branch instructions to compare the content of 
registers and transfer control, and arithmetic operations to 
perform computations on the contents of registers. Swap 
instructions provide multi-thread and multi-processor syn- 
chronization. These operations are preferably indivisible and 
include such instructions as add and -swap, compare-and- 
swap. and multiplex-and-swap instructions. The fixed-point 
compare-and-branch instructions within the instruction set 
shown in Table I provide the necessary arithmetic tests for 
equality and inequality of signed and unsigned fixed-point 
values. The branch through gateway instruction provides a 
secure means to access code at a higher privileged level in 
a form similar to a high level language procedure call 
generally known in the art 
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The general purpose media processor 12 also preferably 
supports floating-point compare-and-branch instructions. 
The arithmetic operations, which are supported in hardware, 
include floating-point addition, subtraction, multiplication. 

5 division and square root. The general purpose media pro- 
cessor 12 preferably supports other floating-point operations 
defined by the ANSI-IEEE floating-point standard through 
the use of software libraries. A floating point value can 
preferably be 16. 32. 64 or 128-bits wide. Examples of the 

io presenting preferred floating-point data sizes are illustrated 
in FIG. 9{b). 

The general purpose media processor 12 preferably sup- 
ports virtual memory addressing and virtual machine opera- 
tion through a memory management unit 122. Referring to 

15 FIG. 10(a). one presently preferred embodiment of the 
memory management unit 122 is shown. The memory 
management unit 122 preferably translates global virtual 
addresses into physical addresses by software program- 
mable routines augmented by a hardware translation Jooka- 

20 side buffer ("TLB") 148. A facility for local virtual address 
translation 164 is also preferably provided. As those skilled 
in the art will appreciate, the memory management unit 122 
includes a data cache 166 and a tag cache 168 that store data 
and tags associated with memory sections for each entry in 

25 the TLB 148. 

A block diagram of one preferred embodiment of the TLB 
148 is shown in FIG. 10O). The TLB 148 receives a virtual 
address 230 as its inpuL For each entry in the TLB 148. the 
virtual address 230 is logically AND-ed with a mask 232. 

30 The output of each respective AND gate 234 is compared via 
a comparator 236 with each entry in the TLB 148. If a match 
is detected, an output from the comparator 236 is used to 
gate data 240 through a transceiver 238. As those skilled in 
the art will appreciate, a match indicates the entry of the 

35 corresponding physical address within the contents of the 
TLB 148 and no external memory or I/O access is required. 
The data 240 for the data cache 166 (FIG. 10(<*)) is then 
combined with the remaining lower bits of the virtual 
address 230 through an exdusive-OR gate 242, The result- 

40 ant combination is the physical address 244 output from the 
TLB 148. If a match is not detected between the logical 
address and the contents of the tag cache 168, the memory 
management unit 122 an external memory or I/O access is 
necessary to retrieve the relevant portion of memory and 

45 update the contents of the TLB 148 accordingly. 

Using generally known memory management techniques, 
the memory management unit 122 ensures that instructions 
(and data) are properly retrieved from external memory (or 
other sources) over an external input/output bus 126 (see 

so FIG. 7). As described in more detail below, a high bandwidth 
interface 124 is coupled to the external input/output bus 126 
to communicate instructions (and media data streams) to the 
general purpose media processor 12. The presently preferred 
physical address width for the general purpose media pro- 

55 cessor 12 is eight bytes (64-bits). In addition, the memory 
management unit 122 preferably provides match bits (not 
shown) that allow large memory regions to be assigned a 
single TLB entry allowing for fine grain memory manage- 
ment of large memory sections. The memory management 

60 unit 122 also preferably includes a priority bit (not shown) 
that allows for preferential queuing of memory areas accord- 
ing to respective levels of priority. Other memory manage- 
ment operations generally known in the art are also per- 
formed by the memory management unit 122. 

65 Referring again to FIG. 7. instructions received by the 
general purpose media processor 12 are stored in i com- 
bined instruction buffer/cache 118. The instruction buffer/ 



Case 2:05-cv-00505-TJW Document 129 Filed 09/12/2007 Page 15 of 21 



5.794.061 

17 18 

cache 118 is dynamically subdivided to store the largest media processor 12. The super-spring pjpelme technique 

sequence of instructions capable of execution by the execu- breaks the super-string pipeline shown in FIG. 11 into two 

tion unit 106 without the necessity of accessing external sections that are coupled via a memory buffer (not shown), 

memory. In a preferred embodiment of the invention. a visual representation of the super-spring pipeline tech- 
instruction buffer space is allocated to the smallest and most 5 nique is shown in FIG. 12. The front of the pipeline 204. in 

frequently executed blocks of media instructions. The which address calculation (A), memory load (L). and branch 

instruction buffer thus helps maintain the high bandwidth (jjj operations are handled, is decoupled from ihc back of 

capacity of the general purpose media processor 12 by the pipeline 206. in which data calculation (t£) and memory 

sustaining the number of instructions executed per second at slore ^ operations are handled. The decoupling is accom- 
or near peak operation. That portion of the instruction w plishcd tnioU gh the memory buffer (not shown), which is 

buffer/cache 118 not used as a buffer is. therefore, available arttatity organized in a first-in-first-out ("FIFO") fast/ 
to be used as cache memory. The instruction buffer/cache strU cture. (The memory buffer is functionally repre- 

118 is coupled to the instruction path 112 and is preferably ^ & spfing ^ m n ) 

32 kilobytes in size . As indicated in Table I above, the general purpose media 

tran^ « P~ » - ^^JS^S^ 

and register file 110. The data buffer/cache 120 is also and so relies upon branch or fetch predion techniques to 

Dynamically subdivided in a manner similar to that of the keep the pipeline full in program flows around unconditional 

instruction buffer/cache 118. The buffer portion of the data and conditional branch instructions. Many such techniques 

buffer/cache 120 is optimized to store a set size of unified are generally known in the art. Examples of some presently 

media data capable of execution without the necessity of 20 preferred techniques include the use of group compare and 

accessing external memory. In a preferred embodiment of set. and multiplex operations to eliminate unpredictable 

the invention, data buffer space is allocated to the smallest branches; the use of short forward branches, which cause 

and most frequently accessed working sets of media data. pipeline neutralization; and where branch and link predicts 

Like the instruction buffer, the data buffer thus maintains the return address in a one or more entry stack. In addition, 

peak bandwidth of the general purpose media processor 12. 25 the specialized gateway instructions included in the general 

The data buffer/cache 120 is coupled to the data path 108 purpose media processor 12 allow for branches to and from 

and is preferably also 32 kilobytes in size. protected virtual memory space. The gateway instructions. 

The preferred embodiment of the general purpose media therefore, allow an efficient means to transfer between 

processor 12 includes a pipelined instruction pre-fetch stroc- various levels of privilege. 

ture. Although pipelined operation is supported, the general 30 As described above, two basic forms of media data are 
purpose media processor 12 also allows for non-pipetaned processed by the general purpose media processor 12. as 
operations to execute without any operational penalty. One shown in FIG. 7. These data streams generally comprise 
preferred pipeline structure for the general purpose media Nyquist sampled I/O 128. and standard memory and I/O 
processor 12 comprises a "super-string" pipeline shown in 13#. As shown in FIG. 7. audio 132. video 134, radio 136. 
FiG. 11. A super-string pipeline is designed to fetch and 35 network 138, tape 140 and disc 142 data streams comprise 
execute several instructions in each clock cyde.The instruc- some examples of digitally sampled I/O 128. As those 
tions available for the general purpose media processor 12 skilled in the an will appreciate, other forms of digitally 
can be broken down into five basic steps of operation. These sampled I/O are contemplated for processing by the general 
steps include a register-to-register address calculation, a purpose media processor 12 without departing from the 
memory load, a register-to-register data calculation, a 40 spirit and scope of the invention. Standard memory and I/O 
memory store and a branch operation. According to the 130 comprises data received and transmitted to and from 
super-string pipeline organization of the general purpose general digital peripheral devices used in the design of most 
media processor 12. one instruction from each of these five computer systems. As shown in FIG. 7. some examples of 
types mav be issued in each clock cycle. The presently such devices include dynamic random access memory 
preferred "ordering of these operations are as listed above 45 ("DRAM") 146. or any data received over the PCI bus 144 
where each of the five steps are assigned letters "A." "L. M generally known in the art. Other forms of standard memory 
"E.^ and (see FIG- U). and I/O sources are also contemplated. The various fixed- 
According to the super-string pipelining technique, each point data sizes preferred for the general purpose media 
of the instructions are serially dependent as shown in FIG. processor 12 are illustrated in FIG. 9(c). 
11. and the general purpose media processor 12 has the so External Interface 
ability to issue a string of dependent instructions in a single 

dock cycle. These instructions shown in FIG. 11 can take As mentioned above, the general purpose media processor 
from two to five cycles of latency to execute, and a branch 12 includes a high bandwidth interface 124 to communicate 
prediction mechanism is preferably used to keep up the with external memory and input/output sources. As part of 
pipeline filled (described below). Instructions can be 55 the hitfi bandwidth interface 124. the general purpose media 
encoded in unit categories such as address, load, store/sync. processor 12 integrates several fast communication channels 
fixed, float and branch to allow for easy decoding. A similar 156 (FIG 13) to communicate externally. These fast corn- 
scheme is employed to pre fetch data for the general purpose munication channels 156 preferably couple to external 
media processor 12. caches 150. which serve as a buffer to memory interfaces 

As those skilled in the an will appreciate, die super-string 60 152 coupled to standard memory 154. The caches 150 

pipeline can be implemented in a multi-threaded environ- preferably comprise synchronous static random access 

menu In such an implementation, the number of threads is memory ("SRAM"), each of which are sixty-four kilobytes 

preferably relatively prime with respect to functional unit in size; and the standard memories 154 comprise DRAMV 

rates so that functional units can be scheduled in a non- The memory interfaces 152 transmit data between the 

interfering fashion between each thread. 65 caches 150 and the standard memories 154. The standard 

In another more preferred embodiment, a "super-spring" memories 154 together form the main external memory for 

pipelining scheme h employed with the general purpose the general purpose media processor 12. The cache 150. 
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memory interface 152. standard memory 154 and input/ 
output channel 156 therefore make up a single external 
memory unit 158 for the general purpose media processor 
12. 

According to the presently preferred embodiment of the 
invention, the memory interface protocol embeds read and 
write operations to a single memory space into packets 
containing command, address, data and acknowledgment 
information. The packets preferably include check codes 
that will detect single-bit transmission errors and some 
multiple bit errors. As many as eight operations may be in 
progress at a time in each external memory unit 158. As 
shown in FIG. 13. up to four external memory units 158 may 
be cascaded together to expand the memory available to the 
general purpose media processor 12. and to improve the 
bandwidth of the external memory, Through such cascaded 
memory units 158. the memory interface 152 provides for 
the direct connection of multiple banks of standard memory 
154 to maintain operation of the general purpose media 
processor 12 at sustained peak bandwidths. 

According to one embodiment shown in FIG. 13. up to 
four standard memory devices 154 can be coupled to each 
memory interface 152. Each standard memory 154 thus 
includes as many as four banks of DRAM, each of which is 
preferably sixteen bits wide. The standard memories 154 are 
connected in parallel to the memory interface 152 forming 
a 72-bit wide data bus 160. where 64 bits are preferably 
provided for data transfer and eight bits are provided for 
error correction. In addition to the data bos 160. an address/ 
control bus 162 is coupled between the memory interface 
152 and each standard memory 154. The address/control bus 
162 preferably comprises at least twelve address lines (4 
kilobitsxl6 memory size) and four control lines as shown in 
FIG. 13. An alternate manner for coupling the DRAM's to 
the memory interface 152 is illustrated in FIG. 14. As shown 
in FIG. 14, two banks of four DRAM single inline memory 
modules are coupled in parallel to the memory interface 152. 
The memory interface 152 also supports interleaving to 
enhance bandwidth, and page mode accesses to improve 
latency for localized addressing. 

Using standard DRAM components, the external memory 
units 158 achieve bandwidths of approximately two 
gigabits/second with the standard ineraories 154. When four 
such external memory units 158 are coupled via the com- 
munication channel 156. therefore, the total bandwidth of 
the external main memory system increases to one gigabyte/ 
second. As discussed further below, in implementations with 
two or eight communication channels 156. the aggregate 
bandwidth increases to two and eight gigabytes/second, 
respectively. 

A more detailed depiction of the communication channel 
156 circuitry appears in FIG. 15. According to the preferred 
embodiment of the invention, each communication channel 
156 comprises two unidirectional, byte-wide, differential, 
packet-oriented data channels 156a. 1566 (see FIG. 13). As 
explained above, where memory units 158 are cascaded 
together in series, the output of one memory unit 158 is 
connected to the input of another memory unit 158. The two 
unidirectional channels are thus connected through the 
memory units 158 forming a loop structure and make up a 
single bi-directional memory interface channel 

Referring to FIG. 15. each communication channel 156 is 
preferably eight bits wide, and each bit is transmitted 
differentially. For example, output transceiver 170 far bit 
D^,,, transmits both D Q and f0 0 signals over the communi- 
cation channel 156. Additional transceivers are similarly 
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provided for the remaining bits in the channel 156. (The 
transceiver 176 for bit D 7oM , and associated differential lines 
178. 180 are shown in FIG. 15.) A CLK oltt transceiver 182 
is also provided to generate differential clock outputs 184, 

5 186 over the channel 156. To complete die link between 
memory units 158. input transceivers 188-192 are provided 
in each memory unit 158 for each of the differential bits and 
clock signals transmitted over the communication channel 
156. These input signals 172. 174. 178. 180. 184. 186 are 

0 preferably transmitted through input buffers 194-198 to 
other parts of the memory unit 158 (described above). 

Each memory unit 158 also includes a skew calibrator 200 
and phase locked loop (*PLL") 202. The skew calibrator 200 
is used to control skew in signals output to the communi- 

5 cation channel 156. Preferably, digital skew fields are 
employed, which include set numbers of delay stages to be 
inserted in the output path of the communication channel 
156. Setting these fields, and the corresponding analog skew 
fields, permits a fine level of control over the relative skew 

0 between output channel signals. 

The PLL 202 recovers the clock signal on either side of 
the communication channel 156 and is thus provided to 
remove clock jitter. The clock signals 184. 186 preferably 
comprise a single phase, constant rate clock signal. The 

5 clock signals 184. 186 thus contain alternating zero and one 
values transmitted with the same timing as the data signals 
172. 174, 178. 180. The clock signal frequency is, therefore, 
one-half the byte data rate. The communication channel 156 
preferably operates at constant frequency and contains no 

o auxiliary control, handshaking or flow control information. 
Each external memory unit 158 preferably defines two 
functional regions: a memory region, implemented by the 
cache 150 backed by standard memory 154 (see FIG. 13). 
and a configuration region, implemented by registers (not 

5 shown). Both regions are accessed by separate interfaces; 
the communication channel 156 is used to access the 
memory region, and a serial interface (described below) is 
used to access the configuration region. In the memory 
region, the caches 150 are preferably write-back (write-in) 

o single-set (direct-map) caches for data originally contained 
in standard memory 154. All accesses to memory space 
should maintain consistency between the contents of the 
cache 150 and the contents of the standard memory 154. The 
configuration region registers provide the mechanism to 

5 detect and adjust skew in die communication channel 156. 
Software is preferably employed to adaptively adjust the 
skew in the channel 156 through digital skew fields, as 
explained above. The serial interface thus is used to con- 
figure the external memory units 158. set diagnostic modes 

3 and read diagnostic information, and to enable the use of a 
high-speed tester (not shown). 

One presently preferred embodiment of the invention 
employs two byte-wide packet communication channels 156 
(FIG. 16<o)). In order to further increase the bandwidth of 

5 the general purpose media processor 12. up to sixteen 
byte-wide packet communication channels 156 can be 
employed. Referring to FIG. 16(b). twelve communication 
channels, comprising eight memory channels 210. a ninth 
channel for parallel processing 212 (described below), and 

3 three input/output ("I/O") channels 214. are shown. Each of 
the communication channels 210-214 preferably employs 
the cascade configuration of four channel interface devices 
216. (Each channel interface device 216 coupled to the 
memory channels 210 corresponds to the external memory 

5 unit 158 shown in FTG. 13.) Through each of the twelve 
communication channels shown in FIG. 160). the general 
purpose media processor 12 can request or issue read oi 
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write transactions. When not interleaved, the twelve chan- TTL signal levels al a more moderate data rate. (A more 

nels provide a single contiguous memory space for each detailed description of the serial interface is provided 

channel interface device 216. below). 

Alternatively, memory accesses may be interleaved in ]jfc e me memory channels 210. each I/O channel 214 

order to provide for continuous access to the external 5 includes nine signals— one clock signal and eight data 

memory system at the maximum bandwidth for the DRAM signals. Differential voltage levels are preferably employed 

memories. In an interleaved configuration, at any point in fa cach $ignal chaftncl interface device 216 is pref- 

time some memory devices will be engaged in row pre- ttM terminated in a nominal 50 ohm impedance to 

charge, while others may be driving or receiving data, or ff0un<LThis impedance applies for both inputs and outputs 

receiving row or column addresses. The memory interface |0 |Q ^ rommunication channel 156. A programmable termi- 

152 (FIG. 13) thus preferably maps between a contiguous n ^ on ■ ^occ is preferred, 
address space and each of the separate address spaces made 

available within each external memory unit 158. For maxi- Interface Communication 
mum performance, therefore, the memory interface is inter- 

leaved so thai references to adjacent addresses are handled l5 According to one presently preferred embodiment of the 

by different memory devices. Moreover, in the preferred invention, the channel interface devices 216 can operate as 

embodiment, additional memory operations may be cither Evicts or slave devices. A master device is 

requested before the corresponding DRAM bank is avail- capable of generating a request on the cornmunicaiion 

able. In an interleaved approach, these operations are placed channel 156 and receiving responses from the communica- 

in a queue until they can be processed. According to the m tion channel 156. Slave devices are capable of receiving 

preferred embodiment, memory writes have lower priority requests and generating responses, over the communication 

than memory reads, unless an attempt is made to read an channel 156. A master device is preferably capable of 

address that is queued for a write operation. As those skilled generating a constant frequency clock signal and accepting 

in the art will appreciate, the depth of the memory' write signals at the same clock frequency over the cornmunicaiion 

queue is dictated by the specific implementation. 25 channel 156. A slave device, therefore, should operate at the 

Although up to four external memory units 158 are «■* dock ^ * s * e cornmunication channel 156. and 

preferably cascaded to form effecuvely larger memories. P«* no , ™> rc than a * n ™* rf variation in 

some amount of latency may be introduced by the cascade. output clock phase relative to input clock phase The master 

Packets of data transmitted over the communication channel **vjcc however, can accept an arbitrary mput dock phase 

156 are uniquely addressed to a pirticular channel interface 30 and tolerates a specked amoum of vamtion in clock phase 

device 216. A packet received at a particular device, which over operating conditions. 

specifies another module address, is automatically passed to Packets of information sent over the communication 
the correct channel interface device 216. Unless the module channel 156 preferably contain control commands, such as 
address matches a particular device 216. that packet simply read or write operations, along with addresses and associated 
passes from the input to the output of the interface device 35 data. Other commands are provided to indicate error con- 
216. This mechanism divides the serial interconnection of ditfons and responses to the above conimands. When the 
interface devices 216 into strings, which function as a single communication channel 156 is idle, such as during initial- 
larger memory or peripheral, but with possibly longer ization and between transmitted packets, an idle packet, 
response latency. consisting of an all-zero byte and an all- one byte is 

In addition to the memory channels 210. the general 40 transmitted through the communication channel 156. Each 

purpose media processor 12 provides several communica- noondlc packet consists of two bytes or a muluple of two 

tion channels 214 for communication with external input/ bytes, and begins with a byte having a value other than all 

output devices. Referring to FIG. 16<*). three input/output All packets transmitted over the communication chan- 

channels 214 having SRAM buffered memory (see FM3. 13) 156 also begin during a dock period in which the clock 
provide an interface to external standard I/O devices (not 45 si S nal is 2ero ' ^ P 30 ** 3 Preferably end during a dock 

shown). Like the eight memory channels 210. the three I/O P^* 1 m which *** dock si S Qal 1S onc - A ^P"* 011 of toe 

channels 214 are bvte-widc input/output channels intended preferred packet protocol format for transmission over the 

to operate at rates of at least one gigahertz. The three I/O communication channel 156 appears in FIG. 17. 

channels 214 also operate as a packet conuinimcationlink to The general form of each packet is an array of bytes 
synchronous SRAM memory 208 within the channel inter- 50 preferably without a specific byte ordering. The first byte 

face device 216. A controller 226 within the channel inter- contains a module address 250 ( ,4 ma") in the high order two 

face device 216 completes the interface to the I/O devices. bits; a packet identifier, usually a command 252 ("com"), in 

The three VO channels 214 preferably function in like the next three bit positions; and a link identification number 

manner to the memory channels 210 described above. The 254 ("lid") io the last three bit positions. The interpretation 
interface protocol for the three VO channels 214 divides read 55 of (he remaining bytes of a packet depend upon the contents 

and write operations to a single memory space into packets of the packet identifier. The length of each packet is pref- 

containing command, address, data and acknowledgment erably implied by the command specified in the initial byte 

information. The packets also indude a check code that will of the packet. A check byte is provided and computed as odd 

detect single-bit transmission errors and some multiple-bit bit-wise parity with a leftward circular rotation after accu- 
errors. According to the preferred embodiment of the 60 mulating each byte. This technique provides detecuon of all 

invention, as many as eight operations may progress in each single-bit and some multiple-bit errors, but no correction is 

interface device 216 al a time. As shown in FIG. 16(b). up provided. 

to four channel interface devices 216 can be cascaded The modular address 250 field of each packet is prefer- 
together to expand the bandwidth in the three I/O channels ably a two-bit field and allows for as many as four slave 
214. A bit-seria! interface (not shown) is also provided to 65 devices to be operated from a single communication channel 
each of the channel interface devices 216 to allow access to 156. Module address values can be assigned in one of two 
configuration, diagnostic and tester information at standard fashions: either dynamically assigned through a configura- 
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tion register (not shown), or assigned via static/geometric 
configuration pins. Dynamic assignment through a configu- 
ration register is the presently preferred method for assign- 
ing module address values. 

The link identification number 254 field is preferably 
3-bits wide and provides the opportunity for roaster devices 
to initiate as many as eight independent operations at any 
one time to each slave device. Each outstanding operation 
requires a distinct link identification number, but no ordering 
of operations should be implied by the value of the link 
identification field. Thus, there is preferably no requirement 
for link identification values 254 to be sequentially assigned 
either in requests or responses. 

The receipt of packets over the communication channel 
156 that do not conform to the channel protocol preferably 
generates an error condition. As those skilled in the art will 
appreciate, the level or degrees to which a specific imple- 
mentation detects errors is defined by the user. In one 
presently preferred embodiment of the invention, all errors 
arc detected, and the following protocol is employed for 
handling errors. For each error detected, the channel inter- 
face device 216 causes a response explicitly indicating the 
error condition. Channel interface devices 216 reporting an 
invalid packet will then suppress the receipt of additional 
packets until the error is cleared. The transmitted packet is 
otherwise ignored. However, even though the erroneous 
packet is ignored, the channel interface devices 216 prefer- 
ably continue to process valid packets thai have already been 
received and generate responses thereto. An identification of 
the presently preferred commands 252 to be used over the 
communication channel 156 are listed in FIG. 17. 

In the master/slave preferred embodiment, the channel 
interface devices 216 forward packets that are intended for 
other devices connected to the communication channel 156. 
as described above. In slave devices, forwarding is per- 
formed based on the module address 250 field of the packet 
Packets which contain a module address 250 other than that 
of the current device are forwarded on to the next device. All 
non-idle packets arc thus forwarded including error packets 
In master devices, forwarding is performed based on the link 
identifier number 254 of the packet. Packets that contain link 
identifier numbers 254 not generated by the specific channel 
interface device 216 are forwarded. In order to reduce 
transmission latency, a packet buffer may be provided. As 
those skilled in the art appreciate, the suitable size for the 
packet buffer depends on the amount of latency tolerable in 
a particular implementation. 

A variety of master/slave ring configurations are possible 
using the high bandwidth interface 124 of the invention. 
Five ring configurations are currently preferred: single- 
master, dual-master, multiple-master, single-slave and 
multiple-master/multiple-slave. The simplest ring configu- 
ration contains a single non-forwarding master device and a 
single non-forwarding slave device. No forwarding is 
required far either device in this configuration as packets are 
sent directly to the recipient. A single-master ring, however, 
may contain a cascade of up to four slave devices (sec FIGS. 
13. 16). In the single-master ring configuration, each slave 
device is configured to a distinct module address, and each 
slave device forwards packets that contain module address 
fields unequal to their own. As discussed above, a single- 
master ring provides a larger memory or I/O capacity than 
a master-slave pair, but also introduces a potentially longer 
response latency. In the single-master ring, each slave device 
may have as many as eight transactions outstanding at any 
time, as described above. 

The remaining combinations share many of the above 
basic anributes. In a dual-master pair, each master device 



.061 

24 

may initiate read and write operations addressed to the other. 
" and each may have up to eight such transactions outstanding. 
No forwarding is required for either device because packets 
are sent directly to the recipient. A multiple-master ring may 

5 contain multiple master devices and a single slave device. In 
this configuration, the slave device need not forward packets 
as all input packets are designated for the single slave 
device A multiple-master ring may contain multiple master 
devices and as many as four slave devices. Each slave device 

to may have up to eight transactions outstanding, and each 
master device may use some of those transactions. In a 
preferred errmodiment a master also has the capability to 
detect a time-out condition or when a response to a request 
packet is not received. Further aspects of interprocessor 

15 communications and configurations are discussed below in 
connection with FIG. 18. 

Serial Bus 

In one preferred crriboduiicnt of the invention, the general 
purpose media processor 12 includes a serial bus (not 
shown). The serial bus is designed to provide bootstrap 
resources, configuration, and diagnostic support to the gen- 
eral purpose media processor 12. The serial bus preferably 
employs two signals, both atTTL levels, for direct commu- 
nication among many devices. In the preferred embodiment 
the first signal is a continuously running clock, and the 
second signal is an open-collector bi-directional data signal 
Four additional signals provide geographic addresses for 
each device coupled to the serial bus. A gateway protocol, 
and optional configurable addressing, each provide a means 
to extend the serial bus to other buses and devices. Although 
the serial bus is designed for implementation in a system 
having a general purpose media processor 12. as those 

35 skilled in the art will appreciate, the serial bus is applicable 
to other systems as well 

Because the serial bus is preferably used for the initial 
bootstrap program load of the general purpose media pro- 
cessor 12. the bootstrap ROM is coupled to the serial bus. As 

^ a result the serial bus needs to be operational for the first 
instruction fetch. The serial bus protocol is therefore devised 
so that no transactions are required for initial bus configu- 
ration or bus address assignment 

According to the preferred embodiment the clock signal 

45 comprises a continuously running dock signal at a minimum 
of 20 megahertz. The amount of skew, if any, in the clock 
signal between any two serial bus devices should be limited 
to be less than the skew on the data signal. fteferaWy. the 
serial data signal is a non-inverted open collector 

50 bi-directional data signal TTL levels are preferred for 
communication on the serial bus. and several termination 
networks may be employed for the serial data signal. A 
simple preferred termination network employs a resistive 
pull-up of 220 ohms to 33 volts above V y5 . An alternate 

55 embodiment employs a more complex termination network 
such as a termination network including diodes or the 
"Forced Perfect Termination* network proposed for the 
SCSI-2 standard, which may be advantageous for larger 
configurations. 

60 The geographic addressing employed in the serial bus is 
provided to insure that each device is addressable with a 
number that is unique among all devices on the bus and 
which also j^eferabry reflects the physical location of the 
device. Thus, the address of each device remains the same 

65 each time the system is operated. In one preferred 
embodiment, the geographic address is composed of four 
bits, thus allowing for up to 16 devices. In order to extend 
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the geographic addressing to more than 16 devices, addi- 
tional signals may be employed such as a buffered copy of 
the clock signal or an inverted copy of the clock signal (or 
both). 

The serial bus preferably incorporates both a bit level and 
packet protocol. The bit level protocol allows any device to 
transmit one bit of information on ihe bus. which is received 
by all devices on the bus at the same time. Each transmitted 
bit begins at the rising edge of the clock signal and ends at 
the next rising edge. The transmitted bit value is sampled at 
the next rising edge of the clock signal. According to one 
preferred embodiment where the serial data signal is an open 
collector signal, the transmission of a zero bit value on the 
bus is achieved by driving the serial data signal to a logical 
low value. In this embodiment, the transmission of a one bit 
value is achieved by releasing the serial data signal to obtain 
a logical high value. If more than one device attempts to 
transmit a value on the same clock, the resulting value is a 
zero if any device transmits a zero value, and one if all 
devices transmit a one value. This provides a 'Svired-ANIT 
collision mechanism as those skilled in the art will appre- 
ciate. If two or more devices transmit the same value on the 
same clock cycle, however, no device can detect the occur- 
rence of a collision. In such cases, the transaction, which 
may occur frequently in some implementations, preferably 
proceeds as described below. 

The packet protocol employed with the serial bus uses the 
bit level protocol to transmit information in units of eight 
bits or multiples of eight bits. Each packet transmission 
preferably begins with a start bit in which the serial data 
signal has a zero (driven) value. After transmitting the eight 
data bits, a parity bit is transmined. The transmission con- 
tinues with additional data. A single one (released) bit is 
transmitted immediately following the least significant bit of 
each byte signaling the end of the byte. 

On the cycle following the transmission of the parity bit. 
any device may demand a delay of two cycles to process the 
data received. The two cycle delay is initiated by driving the 
serial data signal (to a zero value) and releasing the serial 
data signal on the next cycle. Before releasing the serial data 
signal however, it is preferable to insure that the signal is 
not being driven by any other device. Further delays are 
available by repeating this pattern. 

In order to avoid collisions, a device is not permitted to 
start a transmission over the serial bus unless there are no 
currently executing transactions. To resolve collisions that 
may occur if two devices begin transmission on the same 
cycle, each transmitting device should preferably monitor 
the bus during the transmission of one (released) bits. If any 
of the bits of the byte arc received as zero when transmitting 
a one, the device has lost arbitration and must cease trans- 
mission of any additional bits of the current byte or trans- 
action. 

According to the preferred embodiment of the invention, 
a serial bus transaction consists of the transmission of a 
series of packets. The transaction begins with a transmission 
by the transaction initiator, which specifies the target 
network, device, length, type and payload of the transaction 
request. The transaction terminates with a packet having a 
type field in a specified range. As a result, all devices 
connected to the serial bus should monitor the serial data 
signal to determine when transactions begin and end. A 
serial bus network may have multiple simultaneous trans- 
actions occurring, however, so long as the target and initiator 
network addresses are all disjoint. 

Parallel Processing 
In one preferred embodiment of the invention, rwo or 
more general purpose media processors 12 can be linked 
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together to achieve a multiple processor system. According 
to this embodiment, general puipose media processors 12 
are linked together using their high bandwidth interface 
channels 124. either directly or through external switching 
5 components (not shown). The dual- master pair configuration 
described above can thus be extended for use in multiple- 
master ring configurations. Preferably, internal daemons 
provide for the generation of memory references to remote 
processors, accesses to local physical memory space, and the 
transport of remote references to other remote processors. In 
1 a multi-processor environment, all general purpose media 
processors 12 run off of a common clock frequency, as 
required by the communication channels 156 that connect 
between processors. 

Referring to FIG. 18. each general purpose media pro- 
15 cessor 12 preferably includes at least a pair of inter- 
processor links 218 (see also FIG. 16(b)). In one 
configuration, both pairs of intcr-processor links 218 can be 
connected between the two processors 12 to further enhance 
bandwidth. As shown in FIG. 18(a) several processors 12 
20 may be interconnected in a linear network employing the 
transponder daemons in each processor. In an alternate 
embodiment shown in FIG. 19(b), the inter-processor links 
218 may be used to join the general purpose media proces- 
sors 12 in a ring configuration. Alternatively still, general 
25 purpose media processors 12 may be interconnected into a 
two-dimensional network of processors of arbitrary size, as 
shown in FIG. 18(c). Sixteen processors are connected in 
FIG. 18(c) by connecting four ring networks. In yet another 
alternate emrx>diment. by connecting the inter-processor 
jo links 218 to external switching devices (not shown), multi- 
processors with a large number of processors can be con- 
structed with an arbitrary interconnection topology. 

The requester, responder and transponder daemons pref- 
erably handle all inter-processor operations. When one gen- 
35 era! purpose media processor 12 attempts a load or store to 
a physical address of a remote processor, the requester 
daemon autonomously attempts to satisfy the remote 
memory reference by communicating with the external 
device. The external device may comprise another processor 
40 12 or a switch! ng device (not shown) that eventually reaches 
another processor 12. Preferably, two requester daemons are 
provided each processor 1Z which act concurrently on two 
different byte channels and/or module addresses. The 
responder daemon accepts writes from a specified channel 
45 and module address, which enables an external device to 
generate transaction requests in local memory or to generate 
processor events. The responder daemon also generates link 
level writes to the same external device that communicated 
responses for the received transaction request. Two such 
50 responder daemons are preferably provided; each of which 
operate concurrently to two different byte channels and/or 
module addresses. 

The transponder daemon accepts writes from a specified 
channel and module address, which enable an external 
55 device to cause a requester daemon to generate a request on 
another channel and module address. Preferably, two such 
transponder daemons are provided, each of which act con- 
currently (back-to-back) between two different byte channel 
and/or module addresses. As those skilled in the art will 
60 appreciate, the requester, responder and transponder dae- 
mons must act cooperatively to avoid deadlock that may 
arise due to an imbalance of requests in the system Dead- 
locks prevent responses from being routed to their 
destinations, which may defeat the benefits of a multi- 
65 processor distributed system. 

According to one presently preferred crnbodimcnt of the 
invention, the general purpose media processor 12 can be 
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implemented as one or more integrated circuit chips. Refer- 
ring to FIG. 19. the presently preferred embodiment of the 
genera] purpose media processor 12 consists of a four-chip 
set. In the four-chip set. a general purpose media processor 
12 is manufactured as a stand alone integrated circuit. The 5 
stand alone integrated circuit includes a memory manage- 
ment unit 122. instruction and data cacheJbuffers 118. 120. 
and an execution unit 140. A plurality of signal input/output 
pads 260 are provided around the circumference of the 
integrated circuit to communicate signals to and from the 10 
general purpose media processor 12 in a manner generally 
known in the art. 

The second and third chips of the four-chip set comprise 
in an external memory element 158 and a channel interface 
device 216. The external memory element 158 includes an 
interface to the communication channel 156. a cache 150 
and a memory interface 152. The channel interface device 
216 also includes an interface to the communication channel 
156, as well as buffer memory 262. and input/output inter- 
faces 264. Both the external memory element 158 and the 20 
channel interface device 216 include a plurality of input/ 
output signal pads 260 to communicate signals to and from 
these devices in a generally known manner. 

The fourth integrated circuit chip comprises a switch 226. 
which allows for installation of the general purpose media 23 
processor 12 in the heterogeneous network 38. In addition to 
the plurality of input/output pads 260. the switch 226 
includes an interface to the communication channel 156. The 
switch 226 also preferably includes a buffer 262, a router 
266. and a switch interface 268. 30 

As those skilled in the art will appreciate, many imple- 
mentations for the general purpose media processor 12 are 
possible in addition to the four-chip implementation 
described above. Rather than an integrated approach, the 35 
general purpose media processor can be implemented in a 
discrete manner. Alternatively, the general purpose media 
processor 12 can be implemented in a single integrated 
circuit, or in an implementation with fewer than four inte- 
grated circuit chips. Other combinations and permutations of ^ 
these implementations are contemplated. 

There has been described a system for processing streams 
of media data at substantially peak rates to allow for real 
time communication over a large heterogeneous network. 
The system includes a media processor at its core mat is 45 
capable of processing such media data streams. The hetero- 
geneous network consists of. for example, the fiber optic/ 
coaxial cable/twisted wire network in place throughout the 
U.S. To provide for such communication of media data, a 
media processor according to the invention is disposed at so 
various locations throughout the heterogeneous network. 
The media processor would thus function both in a server 
capacity and at an end user site within the network. 
Examples of such end user sites include televisions, set-top 
converter boxes, facsimile machines, wireless and cellular 55 
telephones, as well as large and small business and industrial 
applications. 

To achieve such high rates of data throughput, the media 
processor includes an execution unit, high bandwidth 
interface, memory management unit, and pipelined instruc- 60 
tion and data paths. The high bandwidth interface includes 
a mechanism for transmitting media data streams to and 
from the media processor at rates at or above the gigahertz 
frequency range. The media data stream can consist of 
transmission, presentation and storage type data transmitted 65 
alone or in a unified manner. Examples of such data types 
include audio, video, radio, network and digital communi- 
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cations. According to the invention, the media processor is 
dynamically partitionable to process any combination or 
permutation of these data types in any size. 

A programmable, general purpose media processor sys- 
tem presents significant advantages over current multimedia 
communications. Rather than rigid, costly and inefficient 
specialized processors, the media processor provides a gen- 
eral purpose instruction set to ease programmability in a 
single device that is capable of performing all of the opera- 
tions of the specialized processor combination. Providing a 
uniform instruction set for all media related operations 
eliminates the need for a programmer to icarn several 
different instruction sets, each for a different specialized 
processor. The complexity of programming the specialized 
processors to work together and communicate with one 
another is also greatly reduced. The unified instruction set is 
also more efficient. Highly specialized general calculation 
instructions that are tailored to general or special types of 
calculations rather than enhancing communication arc elimi- 
nated. 

Moreover, the media processor system can be easily 
reprogrammed simply by transmitting or downloading new 
software over the network. In the specialized processor 
approach, new programming usually requires the delivery 
and installation of new hardware. Reprogramming the media 
processor can be done electronically, which of course is 
quicker and less costly than the replacement of hardware. 

It is to be understood that a wide range of changes and 
modifications to the embodiments described above will be 
apparent to those skilled in the art and are contemplated. It 
is therefore intended that the foregoing detailed description 
be regarded as illustrative rather than limiting, and that it be 
understood that it is the following claims, including all 
equivalents, that are intended to define the spirit and scope 
of this invention. 

We claim: 

1. A general purpose programmable media processor 
having an instruction path and a data path to digitally 
process a plurality of media data streams, comprising: 

a high bandwidth external interface operable to receive a 
plurality of data of various sizes from an external 
source and communicate the received data over the data 
path at a rate thai maintains substantially peak opera- 
tion of the media processor, 

at least one register file configurable to receive and store 
data from the data path and to communicate the stored 
data to the data path; and 

a multi-precision execution unit coupled to the data path, 
the multi-precision execution unit configurable to 
dynamically partition data received from the data path 
to account for the elemental symbol width of the 
plurality of media data streams, said elemental symbol 
width being equal to ox narrower than the data path, and 
programmable to operate on the data to generate a 
unified symbol output to the data path. 

2. The media processor defined in claim 1. wherein the 
execution unit is dynamically configurable to partition data 
received from the data path. 

3. The media processor defined in claim 1. further com- 
prising: 

means for moving data between registers and memory by 
performing load and store operations, and for coordi- 
nating the sharing of data among a plurality of tasks by 
performing synchronization operations based upon 
instructions and data received by the execution unit: 

means for securely controlling the sequence of execution 
by performing branch and gateway operations based 
upon instructions and data received by the execution 
unit: and 



Case 2:05-cv-00505-TJW Document 129 Filed 09/1 2/2007 Page 21 o 



5.79 

29 

a memory management unit, the memory management 
unit operable to retrieve data and instructions for timely 
and secure communication over the data path and 
instruction path. 

4. The media processor defined in cJaim 3. further com- 
prising: 

a comhined instruction cache and buffer, the combined 
instruction cache and buffer dynamically allocated 
between cache space and buffer space to ensure real- 
time execution of multiple media instruction streams: 
and 

a combined data cache and buffer, the combined data 
cache and buffer dynamically allocated between cache 
space and buffer space to ensure real-time response for 
multiple media data streams. 

5. The media processor defined in claim 4. wherein 
real-time execution is ensured by dynamically allocating 
instruction buffer space to the smallest and most frequently 
executed blocks of media instructions. 

6. The media processor defined in claim 4. wherein 
real-time response is ensured by dynamically allocating data 
buffer space to the smallest and most frequently accessed 
working sets of media data. 

7. The media processor defined in claim 1. wherein media 
data streams comprise Nyquist sampled inputs and outputs. 

8. The media processor defined in claim 1. wherein media 
data streams originate from standard computer memory and 
I/O interfaces. 

9. The media processor defined in claim 1. wherein the 
multi-precision execution unit is configurable to divide the 
data into component symbols of various sizes, analyze the 
component symbols based upon instructions, and resynthe- 
size the component symbols for communication over the 
data path. 

10. The media processor defined in claim 1. wherein the 
plurality of media data streams comprise presentation media 
information, transmission media information, and storage 
media information. 

11. The media processor defined in claim 10. wherein 
presentation media information comprises audio, video, 
image, and graphical information. 

12. The media processor defined in claim 10. wherein 
transmission media information comprises radio and net- 
work data transmissions. 

13. The media processor defined in claim 10. wherein 
storage media information comprises data encoded in mov- 
ing and solid-state memory media. 

14. The media processor defined in claim 1. wherein the 
width of the data path is at least 128 bits. 

15. The media processor defined in claim 1. wherein the 
multi-precision execution unit comprises a dynamically par- 
titionable arithmetic unit, a register controllable cross-bar 
switch, and an extended mathematical element 

16. The media processor defined in claim 13. wherein the 
register controllable cross-bar switch comprises a Benes 
network design. 

17. The media processor defined in claim 15. wherein the 
register controllable cross-bar switch is programmable and 
is operable to manipulate symbols. 

18. The media processor defined in claim 11. wherein the 
extended mathematical element is operable to perform finite 
group, finite field, finite ring and table look-up operations on 
the symbols. 
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19. The media processor defined in claim 1. further 
comprising a set of predefined instructions accessible by a 
user. 

20. The media processor defined in claim 2. wherein the 
5 means for performing load, store, and synchronization 

operations and the means foi performing branch and gate- 
way operations comprises a set of predefined instructions 
accessible by a user. 

21. The media processor defined in claim 20. wherein the 
10 predefined instructions are combinable to implement com- 
posite functions on the plurality of media data streams. 

22. A parallel multi-processor system that maintains sub- 
stantially peak data throughput in the unified execution of a 

l5 plurality media data streams, the system having a data path, 
comprising: 

at least one high bandwidth external interface, the at least 
one high bandwidth external interface coupled to the 
data path and operable to receive a plurality of data of 
20 various sizes from an external source and communicate 
the received data over the data path at a rate that 
maintains substantially peak operation of the parallel 
multi-processor system; 
a plurality of register files, each register file having at least 
25 one general purpose register coupled to the data path 
and operable to store a working set of media data 
received from the data path and to communicate the 
stored data to the data path; and 
at least one multi-precision execution unit coupled to the 
data path, the at least one multi-precision execution 
unit configurable to dynamically partition data within 
the working set of media data received from the data 
path to account for the elemental symbol width of the 
plurality of media data streams, said elemental symbol 
width being equal to or narrower than the data path, and 
programmable to operate in parallel on the dynamically 
partitioned data to generate a unified symbol output for 
each register file. 

23. The parallel multi-processor system defined in claim 
22. wherein the at least one execution unit alternates in a 
round robin manner to operate on data stored in the plurality 
of register files. 

24. The parallel multi-processor system defined in claim 
22, further comprising an instruction pre-fetch pipeline. 

25. The parallel multi -processor system defined in claim 
24. wherein the instruction pre-fetch pipeline comprises a 
super-string pipeline, 

26. The parallel multi-processor system defined in claim 
24. wherein the instruction pre-fetch pipeline comprises a 
super-spring pipeline. 

27. The parallel multi-processor system defined in claim 
22. further comprising a data pre-fetch pipeline. 

28. The parallel multi-processor system defined in claim 
27. wherein the data pre-fetch pipeline comprises a super- 
string pipeline. 

2*. The parallel multi-processor system defined in claim 
27. wherein the data pre fetch pipeline comprises a super- 
spring pipeline. 

30. The parallel multi -processor system defined in claim 
22. further comprising a requester, responder and transpon- 
der daemon. 

***** 
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ABSTRACT 



A general purpose, programmable media processor for pro- 
cessing and transmit ling a media data stream of audio, 
video, radio, graphics, encryption, authentication, and net- 
working information in real-time. The media processor 
incorporates an execution unit that maintains substantially 
peak data throughout of media data streams. The execution 
unit includes a dynamically parlionable multi-precision 
arithmetic unit, programmable switch and programmable 
extended mathematical element. A high bandwidth external 
interface supplies media data streams al substantially peak 
rates to a general purpose register file and the multi- 
pi ecisiun execution unit. A memory management ujiii, and 
instruction and data cache/buffers are also provided. High 
bandwidth memory controllers are linked in scries to pro- 
vide a memory channel to the genera! purpose, program- 
mable media processor. The general purpose, programmable 
media processor is disposed in a network fabric consisting of 
fiber optic cable, coaxial cable and twisted pair wires to 
transmit, process and receive single or unified media data 
streams. Parallel general purpose media processors are dis- 
posed throughout the network in a distributed virtual manner 
to allow for multi-processor operations and sharing of 
resources through the network. A method for receiving, 
processing and transmitting media data streams over the 
communications fabric is also provided. 

8 Claims, 25 Drawing Sheets 
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GENERAL PURPOSE, MULTIPLE 
PRECISION PARALLEL OPERATION, 
PROGRAMMABLE MEDIA PROCESSOR 

This is a divisional of application Ser. No. 08/516,036, 
filed Aug. 16, 1995, now U.S. Pat. No. 5,742,840. 

A Microfiche Appendix consisting of 4 sheeis (3S7 total 
frames) of microfiche is included in this application. The 
Microfiche Appendix contains material which is subjeci lo 
copyright protection, ITie copyright owner has no objection 
to the facsimile reproduction by any one of the Microfiche 
Appendix, as it appears in the Patent and Trademark Office 
pat en I files or records, but otherwise reserves all copyright 
rights whatsoever. 

FIELD OF THE INVENTION 

This invention relates to the field of communications 
processing, and more particularly, tn a method and apparatus 
for real-time processing of multi-media digital communica- 
tions. 

BACKGROUND OF THE INVENTION 

Optical fiber and discs have made Ihe transmission and 
storage of digital information both cheaper and easier than 
older analog technologies. An improved system for digital 
processing of media data streams is necessary in order to 
realize ihe full potential of these advanced media. 

For the past century, telephone service delivered over 
copper twisted pair has been the lingua franca of commu- 
nications. Over the next century, broadband services deliv- 
ered over optical fiber and coax will more completely fulfill 
the human need for sensory information by supplying voice, 
video, and data at rates of about 1,000 times greater than 
narrow band telephony. Current general-purpose micropro- 
cessors and digital signal processors ("DSPs 3 ') can handle 
digital voice, data, and images at narrow band rales, but they 
are way too slow for processing media data at broadband 
rates. 

litis shortfall in digital processing of broadband media is 
currently being addressed through the design of many dif- 
ferent kinds of application-specific integrated circuits 
("ASICs"). For example, a prototypical broadband device 
such as a cable modem modulates and demodulates digital 
data at rates up to 45 Mb its/sec within a single 6 MHZ cable 
channel (as compared to rates of 28.8 Kbits/sec within a 6 
KHz channel for telephone modems) and transcodes it onto 
a 10/100 baseT connection to a personal computer ("PC") or 
workstation. Current cable modems thus receive dala from 
a coaxial cable connection through a chain of specialized 
ASIC devices in order to accomplish Quadrature Amplitude 
Modification ("QAM'") demodulation, Reed-Solomon error 
correction, packet filtering, Data Encryption Standard 
O'DES') decryption, arid Ethernet protocol handling. The 
cable modems also transmit data to the coaxial cable link 
through a second chain of devices to achieve DES 
encryption, Reed-Solomon block encoding, and Quaternary 
Phase Shift Keying ('QPSK : ) modulation. In these 
environments, a general-purpose processor is usually 
required as well in order lo perform initialization, statistics 
collection, diagnostics, and network management funciioas. 

'Hie ASIC approach to media processing has three fun- 
damental flaws: cost, complexity, and rigidity. The com- 
bined silicon area of all the specialized ASIC devices 
required in the cable modem, for example, results in a 
component cost incompatible with the per subscriber price 
target for a cable service. Ihe cable plant itself is a very 
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hostile service environment, with noise ingress, reflections, 
nonlinear amplifiers, and other channel impairments, espe- 
cially when viewed in the upstream direction. Telephony 
modems have developed an elaborate hierarchy of algo- 

5 rithms implemented in DSP software, with automatic reduc- 
tion of data rates from 28.8 Kbits/sec to 19.0 Kbits/sec, 14.4 
Kbits/sec, or much lower rales as needed to accommodate 
noise, echoes, and other impairments in the copper plant. To 
implement similar algorithms on an ASIC-based broadband 

lt) modem is far more complex to achieve in software. 

These problems of cost, complexity, and rigidily are 
compounded further in more complete broadband devices 
such as digital set-top boxes, multimedia PCs, or video 
conferencing equipment, all of which go beyond the basic 

15 radio frequency ("RF") modem functions to include a broad 
range of audio and video compression and decoding 
algorithms, along with remote control and graphical user 
interfaces. Software for these devices must control what 
amounts to a heterogeneous multi-processor, where each 

20 specialized processor has a different, and usually eccentric 
or primitive, programming environment. Even if these pro- 
gramming environments arc mastered, the degree of pro- 
grammability is limited. For example, Motion Picture Expert 
Group-] ("MPEG-P) chips manufactured by AT&T Corpo- 

25 ration will not implement advances such as fractal- and 
wavelet-based compression algorithms, but these chips are 
not readily software upgradeable to the MPEG-II standard. 
A broadband network operator who leases an MPEG ASIC- 
hased product is therefore at risk of having lo continuously 

30 upgrade his system by purchasing significant amounts of 
new hardware just to track the evolution of MPEG stan- 
dards. 

The high cost of ASIC-based media processing results 
from inefficiencies in both memory and logic. A typical 

35 ASIC consists of a multiplicity of specialized logic blocks, 
each with a small memory dedicated to holding ihe data 
which comprises the working set for that block. 'l*he silicon 
area of these multiple small memories is further increased by 
the overhead of multiple decoders, sense amplifiers, write 

4n drivers, etc. required for each logic block. The logic blocks 
are also constrained to operate at frequencies determined by 
the internal symbol rates of broadband algorithms in order to 
avoid additional buffer memories. These frequencies typi- 
cally differ from the optimum speed-area operating point of 

45 a given semiconductor technology. Interconnect and syn- 
chronization of the many logic and memory blocks arc also 
major sources of overhead in the ASIC approach. 

The disadvantages of the prior ASIC appmacli can lrc over 
come by a single unified media processor. lTie cost advan- 

50 I ages of such a unified processor can be achieved by 
gathering all the many ASIC functions of a broadband media 
product inlo a single integrated circuit. Cost reduction is 
further increased by reducing the total memory area of such 
a circuit by replacing ihe multiplicity of small ASIC memo- 

55 ries with a single memory hierarchy large enough to accom- 
modate the sum total of all the working sets, and wide 
enough to supply the aggregate bandwidth needs or all the 
logic blocks. Additionally, the logic block interconnect 
circuitry to this memory hierarchy may be streamlined by 

60 providing a generally programmable switching fabric. Many 
of the logic blocks themselves can also replaced with a 
single multi-precision arithmetic unit, which can be inter- 
nally partitioned under software control to perform addition, 
multiplication, division, and other integer and floating point 

65 arithmetic operations on symbol streams of varying widths, 
while sustaining the full data throughput of the memory 
hierarchy. The residue of logic blocks that perform opera- 
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tions thai are neither arithmetic or permutation group ori- 
ented can be replaced with an extended math unit that 
supports additional arithmetic operations such as finite field, 
ring, and table lookup, while also sustaining the full data 
throughput of the memory hierarchy. 

The above multi-precision arithmetic, permutation 
switch, and extended math operations can then be organized 
as machine instructions that transfer their operands to and 
(mm a single wide multi-ported register file. These instruc- 
tions can be further supplemented with load/store instruc- 
tions that transfer register data to and from a data buffer/ 
cache static random access memory ("SRAM") and main 
memory dynamic random access memories ("DRAMs"), 
and with branch instructions that control the flow of instruc- 
tions executed from an instruction buffer/cache SRAM. 
Extensions to the load/store instructions can be made for 
synchronization, and to branch instructions for protected 
gateways, so that multiple threads of execution for audio, 
video, radio, encryption, networking, etc. can efficiently and 
securely share memory and logic resources of a unified 
machine operating near the optimum speed-area point of the 
target semiconductor process. The data path for such a 
unified media processor can interface to a high speed 
input/output ('J/O") subsystem that moves media streams 
across ultra-high bandwidth interfaces to external storage 
and I/O. 

Such a device would incorporate all of the processing 
capabilities of the specialized multi-ASIC combination into 
a single, unified processing device. The unified processor 
would be agile and capable of reprogramming through the 
transmission of new programs over the communication 
medium. This programmable, general purpose device is thus 
less costly than the specialized processor combination, 
easier to operate and reprogram and can be installed or 
applied in many differing devices and situations. The device 
may also be scalable to communications applications that 
support Yasl numbers of users through massively parallel 
distributed computing. 

It is therefore an object of this invention to process media 
data streams by executing operations at very high bandwidth 
rales. 

It is also an object of this invention to unify the audio, 
video, radio, graphics, encryption, authentication, and net- 
working protocols into a single instruction stream. 

It is also an object of this invention to achieve high 
bandwidth rates in a unified processor that is easy to 
program and more flexible than a heterogeneous combina- 
tion of special purpose processors. 

It is a further object of the invention to support high level 
mathematical processing in a unified media processor, 
including finite group, finite field, finite ring and table 
look-up operations, all at high bandwidth rates. 

It is yet a further object of the invention to provide a 
unified media processor that can be replicated into a multi- 
processor system to support a vast array of users. 

It is yet another object of this invention to allow for 
massively parallel systems within the switching fabric to 
support very large numbers of subscribers and services. 

It is also an object of the invention to provide a general 
purpose programmable processor that could be employed at 
all points in a network. 

It is a further object of this invention to sustain very high 
bandwidth rates to arbitraily large memory and input/output 
systems. 

SUMMARY OF THE INVENTION 

In view of the above, there is provided a system for media 
processing that maintains substantially peak data throughput 
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in the execution and transmission of multiple media data 
streams. The system includes in one aspect a general 
purpose, programmable media processor, and in another 
aspect includes a method for receiving, processing and 
5 transmitting media data streams. The general purpose, pro- 
grammable media processor of the invention further 
includes an execution unit, high bandwidth external 
interface, and can be employed in a parallel multi-processor 
system. 

1LI According to the apparatus of the invention, an execution 
unit is provided that maintains substantially peak data 
throughput in the unified execution of multiple media data 
streams. The execution unit includes a data path, and a 
multi-precision arithmetic unit coupled to the data path and 

15 capable of dynamic partitioning based on the elemental 
width of data received from the data path. The execution unit 
also includes a switch coupled to the data path that is 
programmable to manipulate data received from the data 
path and provide data streams to the data path. An extended 

., 0 mathematical element is also provided, which is coupled to 
the data path and programmable to implement additional 
mathematical operations at substantially peak data through- 
put. In a preferred embodiment of the execution unit, at least 
one register file is coupled to the data path. 

25 According to another aspect of the invention, a general 
purpose programmable media processor is provided having 
an instniction path and a data path 10 digitally process a 
plurality of media data streams. The media processor 
includes a high bandwidth external interface operable to 

3U receive a plurality of data of various sizes from an external 
source and communicate the received data over the data path 
at a rate that maintains substantially peak operation of the 
media processor. At least one register file is included, which 
is configurable to receive and store data from the data path 

35 and to communicate the stored data to the data path. A 
multi-precision execution unit is coupled to the data path 
and is dynamically configurable to partition data received 
from the data path to account for the elemental symbol size 
of the plurality of media streams, and is programmable to 

4n operate on the data to generate a unified symbol output to the 
data path. 

According to the preferred embodiment of the media 
processor, means are included for moving data between 
registers and memory by performing load and store 

45 operations, and for coordinating the sharing of data among 
a plurality of tasks by performing synchronization opera- 
tions based upon instructions and data received by the 
execution unit. Means arc also provided for securely con- 
trolling the sequence of execution by performing branch and 

50 gateway operations based upon instructions and data 
received by the execution unit. A memory management unit 
operable to retrieve data and instructions for timely and 
secure communication over the data path and instruction 
path respectively is also preferably included in the media 

55 processor. The preferred embodiment also includes a com- 
bined instruction cache and buffer that is dynamically allo- 
cated between cache space and buffer space to ensure 
real -lime execution of multiple media instruction streams, 
and a combined data cache and buffer that is dynamically 

60 allocated between cache space and buffer space to ensure 
real-time response for multiple media data streams. 

In another aspect of the invention, a high bandwidth 
processor interface for receiving and transmitting a media 
stream is provided having a data path operable to transmit 

65 media information at sustained peak rates. The high band- 
width processor interface includes a plurality of memory 
controllers coupled in series to communicate stored media 
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information to and from the data path, and a plurality of 
memory elements coupled in parallel to each.of the plurality 
of memory controllers tor storing and retrieving the media 
information. In the preferred embodiment of the high band- 
width processor interface, the plurality of memory control- 
lers each comprise a paired link disposed between each 
memory controller, where the paired links each transmit and 
receive plural bits of data and have differential data inputs 
and outputs and a differential clock signal. 

Yet another aspect of the invention includes a system for 
unified media processing having a plurality of general 
purpose media processors, where each media processor is 
operable at substantially peak data rales and has a dynami- 
cally partitioned execution unit and a high bandwidth inter- 
face for communicating to memory and input/output ele- 
ments to supply data to the media processor at substantially 
peak rates. A bi-directional communication fabric is 
provided, to which the plurality of media processors are 
coupled, to transmit and receive at least one media stream 
comprising presentation, transmission, and storage media 
information. 'Ihe bi-directional communication fabric pref- 
erably comprises a fiber optic network, and a subset of the 
plurality of media processors comprise network servers. 

According to yet another aspect of the invention, a 
parallel multi-media processor system is provided having a 
data path and a high bandwidth external interface coupled to 
the data path and operable to receive a plurality of data of 
various sizes from an external source and communicate the 
received data at a rate that maintains substantially peak 
operation of the parallel multi -processor system. A plurality 
of register files, each having at least one register coupled to 
ihe data path and operable to store data, are also included. At 
least one multiprecision execution unit is coupled to the data 
path and is dynamically configurable to partition data 
received from the data path to account for the elemental 
symbol size of the plurality of media streams, and is 
programmable to operate in parallel on data stored in the 
plurality of register files to generate a unified symbol output 
for each register file. 

According to the method of the invention, unified streams 
of media data are processed by receiving a stream of unified 
media data including presentation, transmission and storage 
information. The unified stream of media data is dynami- 
cally partitioned into component fields of at least one bit 
based on ihe elemental symbol size of data received. The 
unified stream of media data is then processed at substan- 
tially peak operation. 

In one aspect of the invention, the unified stream of media 
data is processed hy storing the stream of unified media data 
in a general register file. Multi-precision arithmetic opera- 
tions can then be performed on the stored stream of unified 
media data based on programmed instructions, where the 
multi-precision arithmetic operations include Boolean, inte- 
ger aud floating point mathematical operations, 1 lie com- 
ponent fields of unified media data can then be manipulated 
based on programmed instructions that implement copying, 
shifting and re -sizing operations. Multi-precision math- 
ematical operations can also be performed on the stored 
stream of unified media data based on programmed 
instructions, where the mathematical operations including 
finite group, finite field, finite ring and table look-up opera- 
tions. Instruction and data pre-fetching are included to fill 
instruction and data pipelines, and memory management 
operations can be performed to retrieve instructions and data 
from external memory'. The instructions and data are pref- 
erably stored in instruction and data cache/buffers, in which 
buffer storage in the instruction and data cache/buffers is 
dynamically allocated to ensure real-lime execution. 
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Other aspects of the invention include a method for 
achieving high bandwidth communications between a gen- 
eral purpose media processor and external devices by pro- 
viding a high bandwidth interface disposed between the 

5 media processor and the external devices, in which the high 
bandwidth interface comprises at least one uni-directional 
channel pair having an input port and an output port. A 
plurality of media data streams, comprising component 
fields of various sizes, arc transmitted and received between 

lLJ Ihe media processor and the external devices a I a rate thai 
sustains substantially peak data throughput at the media 
processor. A method for processing streams of media data is 
also included that provides a bi-directional communications 
fabric for transmitting and receiving at least one stream of 
media data, where Ihe at least one stream of media data 

1: * comprises presentation, transmission and storage informa- 
tion. At least one programmable media processor is provided 
within the communications network for receiving, process- 
ing and transmitting the at least one stream of unified media 
data over the bi-directional communications fabric. 

20 The general purpose, programmable media processor of 
the invention combines in a single device all of the necessary 
hardware included in the specialized processor combina- 
tions to process and communicate digital media data streams 
in real-time. The general purpose, programmable media 

25 processor is therefore cheaper and more flexible than the 
prior approach to media processing. The general purpose, 
programmable media processor is thus more susceptible to 
incorporation within a massively parallel processing net- 
work of general purpose media processors that enhance the 

3U ability to provide real-time multi-media communications to 
the masses. 

These features arc accomplished by deploying server 
media processors and client media processors throughout the 
network. Such a network provides a seamless, global media 

35 super-computer which allows programmers and network 
owners to virtualizc resources. Rather than rcstrictively 
accessing only the memory space and processing time of a 
local resource, the system allows access to resources 
throughout the network. In small access points such as 

40 wireless devices, where very little memory and processing 
logic is available due to limited battery life, the system is 
able to draw upon the resources of a homogeneous multi- 
computer system. 

The invention also allows network owners the facility to 

45 track standards and to deploy new services by broadcasting 
software across the network rather than by instituting costly 
hardware upgrades across the whole network. Broadcasting 
software across the network can be performed at the end of 
an advertisement or other program that is broadcasted 

50 nationally. Thus, services can be advertised and then trans- 
milled to new subscribers at Ihe end of the advertisement. 

These and other features and advantages of the invention 
will he apparent upon consideration of the following 
detailed description of the presently preferred embodiments 

55 of the invention, taken in conjunction with the appended 
drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a broad band media computer 
60 employing the general purpose, programmable media pro- 
cessor of the invention; 

FIG. 2 is a block diagram of a global media processor 
employing multiple general purpose media processors 
according to the invention; 
65 FIG. 3 is an illustration of the digital bandwidth spectrum 
for telecommunications, media and computing communica- 
tions; 
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FIG. 4 is the digital bandwidth spectrum shown in FIG. 3 
taking into account the bandwidth overhead associated with 
compressed video techniques; 

FIG. 5 is a block diagram of the current specialized 
processor solution tor mass media communication, where 
FIG. 5(a) shows the current distributed system, and FIG. 
5(b) shows a possible integrated approach; 

FIG. 6 is a block diagram of two presently preferred 
general purpose media processors, where FIG. 6(a) shows a 
distributed system and FIG. 6(6) shows an integrated media 
processor; 

FIG. 7 is a block diagram of the presently preferred 
structure of a general purpose, programmable media pro- 
cessor according to the invention; 

FIG. 8 is a drawiug consisting of visual illustrations of the 
various group operations provided on the media processor, 
where FIG. 8(a) illustrates the group expand operation, FIG. 
8(o) illustrates the group compress or extract operation. FIG. 
8(c) illustrates the group deal and shuffle operations, FIG. 
8(d) illustrates the group swizzle operation and FIG. 8(e) 
illustrates the various group permute operations; 

FIG. 9 shows the preferred instruction and data sizes for 
the general purpose, programmable media processor, where 
FIG. 9(a) is an illustration of the various instruction formats 
available on the general purpose, programmable media 
processor, FIG. 9(b) illustrates the various floating-point 
data sizes available on the general purpose media processor, 
and FIG. 9(c) illustrates the various fixedpoint data sizes 
available on the general purpose media processor; 

FIG. 10 is an illustration of a presently preferred memory 
management unit included in the general purpose processor 
shown in FIG. 7, where FIG. 10(a) is a translation block 
diagram and FIG. 10(6) illustrates the functional blocks of 
the transaction lookaside buffer; 

FIG. 11 is an illustration of a super-string pipeline tech- 
nique; 

FIG. 12 is an illustration of the presently preferred super- 
spring pipeline technique; 

FIG. 13 is a block diagram of a single memory channel for 
communication to the general purpose media processor 
shown in FIG. 7; 

FIG. 14 is an illustration of the presently preferred con- 
nectiou of standard memory devices to the preferred 
memory interface; 

FIG. 15 is a block diagram of the input/output controller 
for use with the memory channel shown in FIG. 13; 

FIG. 16 is a block diagram showing multiple memory 
channels connected to the general purpose media processor 
shown in FIG. 7, where FIG. 16(a) shows a two-channel 
implementation and FIG. 16(b) illustrates a twelve -channel 
channel embodiment; 

FIG. 17 illustrates the presently preferred packet commu- 
nications protocol for use over the memory channel shown 
in FIG. 13; 

FIG. 18 shows a multi-processor configuration employing 
the general purpose media processor shown in FIG. 7, where 
FIG. 18(a) shows a linear processor configuration, FIG. 
18(6) shows a processor ring configuration, and FIG. 18(c) 
shows a two-dimensional processor configuration; and 

FIG. 19 shows a presently preferred multi-chip imple- 
mentation of the general purpose, programmable media 
processor of the invention. 

DETAILED DESCRIPTION OF THE 
PRESENTLY PREFERRED EMBODIMENT 

Referring to the drawings, where like -reference numerals 
refer to like elements throughout, a broad band microcom- 
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puter 10 is provided in FIG. 1. The broad band microcom- 
puter 10 consists essentially of. a geueral purpose media 
processor 12. As will be described in more detail below, the 
general purpose media processor 12 receives, processes and 

5 transmits media data streams in a bi-directional manner from 
upstream network components to downstream devices. In 
general, media data streams received from upstream net- 
work components can comprise any combination of audio, 
video, radio, graphics, encryption, authentication, and nct- 

iu working information. As those skilled in the art will 
appreciate, however, the general purpose media processor 
12 is in no way limited lo receiving, processing and trans- 
mitting only these types of media information. The general 
purpose media processor 12 of the invention is capable of 

15 processing any form of digital media information without 
departing from the spirit and essential scope of the inven- 
tion. 

System Configuration 

In the preferred embodiment of the invention shown in 

20 FIG. 1, media data streams are communicated to the media 
processor 12 from several sources. Ideally, unified media 
data streams are received and transmitted by the general 
purpose media processor 12 over a fiber optic cable network 
14. As will be described in more detail below, although a 

25 fiber optic cable network is preferred, the presently existing 
communications network in the United Stales consists of a 
combination of fiber optic cable, coaxial cable and other 
transmission media. Consequently, the. general purpose 
media processor 12 can also receive and transmit media data 

3u streams over coaxial cable 14 and traditional twisted pair 
wire connections 16. The specific communications protocol 
employed over the twisted pair 16, whether POTS, ISDN or 
ADSL, is not essential; all protocols are supported by the 
broad band microcomputer 10. The details of these protocols 

35 are generally known to those skilled in the art and no further 
discussion is therefore needed or provided herein. 

Another form of upstream network communication is 
through a satellite link 18. The satellite link 18 is typicaUy 
connected to a satellite receiver 20. ITie satellite receiver 20 

40 comprises an antenna, usually in the form of a satellite dish, 
and amplification circuitry. The details of such satellite 
communications are also generally known in the art, and 
further detail is therefore not provided or included herein. 
As described above, the general purpose media processor 

45 12 communicates in a bi-directional manner to receive, 
process and transmit media data streams to and from down- 
stream devices. As shown in FIG. 1, downstream commu- 
nication preferably lakes place in at least two forms. First, 
media data streams can be communicated over a 

50 bi-directional local network 22. various types of local net- 
works 22 are generally known in the art and many difierent 
forms exist. The general purpose media processor 12 is 
capable of communicating over any of these local networks 
22 and the particular type of network selected is implemen- 

55 tation specific. 

The local network 22 is preferably employed to commu- 
nicate between the unified processor 12, and audiovisual 
devices 24 or other digital devices 26. Presently preferred 
examples of audioAisual devices 24 include digital cable 

60 television, video-on-demand devices, electronic yellow 
pages services, integrated message systems, video 
telephones, video games and electronic program guides. As 
those skilled in the art will appreciate, other forms of 
audio/video devices are contemplated within the spirit and 

65 scope of the invention. Presently preferred embodiments of 
other digital devices 26 for communication with the general 
purpose media processor 12 include personal computers, 
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television sets, work stations, digital video camera 
recorders, and compact disc read-only memories. As those 
skilled in the art will also appreciate, further digital devices 
26 are contemplated for communication to the general 
purpose media processor 12 without departing from the 
spirit and scope of the invention. 

Second, the general purpose media processor preferably 
also communicates with downstream devices over a wireless 
network 28. In the presently preferred embodiment of the 
invention, wireless devices for communication over the 
wireless network 28 can comprise either remote communi- 
cation devices 30 or remote computing devices 32. Presently 
preferred embodiments of the remote communications 
devices 30 include cordless telephones and personal com- 
municators. Presently preferred embodiments of the remote 
computing devices 32 include remote controls and telecom- 
municating devices. As those skilled in the art will 
appreciate, other forms of remote communication devices 30 
and remote computing devices 32 are capable of communi- 
cation with the general purpose media processor 12 without 
departing from the spirit and scope of the invention. An agile 
digital radio (not shown) that incorporates a general purpose 
media processor 12 may be used to communicate with these 
wireless devices. 
Network Configuration 

Referring now lo FIG. 2, the general purpose media 
processor 12 is preferably disposed throughout a digital 
communications network 38. In order to enable communi- 
cation among large and small businesses, residential cus- 
tomers and mobile users, the network 38 can consist of a 
combination of many individual sub-networks comprised of 
three main forms of interconnection. The trunk and main 
branches of the network 38 preferably employ fiber optic 
cable 40 as the preferred means of interconnection. Fiber 
optic cable 40 is used to connect between general purpose 
media processors 12 disposed as network servers 46 or large 
business installations 48 that are capable of coupling directly 
to the fiber optic link 40. For communications to small 
business and residential customers that may be incapable of 
directly coupling to the fiber optic cable 40, a general 
purpose media processor 12 can be used as an interface to 
other forms of network interconnection. 

As shown in FIG. 2, alternate forms of interconnection 
consist of coaxial cable lines 42 and twisted pair wiring 44. 
Coaxial cable lines are currently in place throughout the 
U.S. and is typically employed to provide cable television 
services to residential homes. According to the preferred 
embodiment of the invention, general purpose media pro- 
cessors 12 can be installed at these residential locations 52. 
in contrast to the specialized processor approach, the general 
purpose media processor 12 provides enough bandwidth lo 
allow for bi-directional communications to and from these 
residential locations 52. 

Network servers 46 controlled by general purpose media 
processors 12 are also employed throughout the network 38. 
1 or example, the network servers 46 can be used to interface 
between the fiber optic network 40 and twisted pair wiring 
44. Twisted pair wiring 44 is still employed for small 
businesses 50 and residential locations 52 that do nut or 
cannot currently subscribe to coaxial cable or fiber optic 
network services. General purpose media processors 12 are 
also disposed at these small business locations 50 and 
non-cable residential locations 52. General purpose media 
processors 12 are also installed in wireless or mobile loca- 
tions 52, which are coupled to the network 38 through agile 
digital radios (not shown). As shown in FIG. 2, network 
databases or other peripherals 56 can also coupled to general 
purpose media processors 12 in the network 38. 
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The general purpose media processor 12 is operable at 
significantly high band widths in order to receive, process 
and transmit unified media data streams. Referring to FIG. 
3, the respective frequencies for various types of media data 

5 streams are set forth against a bandwidth spectrum 60. The 
bandwidth spectrum 60 includes three component 
spectrums, all along the same range of frequencies, which 
represent the various frequency rates of digital media com- 
munications. Current computing bandwidth capabilities arc 

iu also displayed. The telecommunicatioas spectnim 62 shows 
the various frequency bands used for telecommunications 
transmission. For example, teletype terminals and modems 
operate in a range between approximately 64 bits/second to 
16 kilobits/second. The ISDN telecommunication protocol 

is ope rates at 64 kilobits/second. At the upper end of the 
telecommunications spectrum 62, Ti and 13 trunks operate 
at one megabit per second and 32 megabits per second, 
respectively. The SONET frequency range extends from 
approximately 128 megabits per second up to approximately 

20 32 gigabits per second. Accordingly, in order to carry such 
broad band communications, the general purpose media 
processor 12 is capable of transferring information at rates 
into the gigabits per second range or higher. 

A spectrum of typical media data streams is presented in 

25 the media spectrum 64 shown in FIG. 3. Voice and music 
transmissions are centered at frequencies of approximately 
64 kilobits per second and one megabit per second, respec- 
tively. At the upper end of the media spectnim 64, video 
transmission takes place in a range from 128 megabits per 

30 second for high density television up to over 256 gigabits per 
second for movie applications. When using common video 
compression techniques, however, the video transmission 
spectnim can be shifted down to between 32 kilobits per 
second to 128 megabits per second as a result of the data 

35 compression. As described below, the processing required to 
achieve the data compression results in an increase in 
bandwidth requirements. 

Current computing bandwidths are shown in the comput- 
ing spectrum 66 of FIG. 3. Serial communications presently 

40 take place in a range between two kilobits per second up to 
512 kilobits per second. The Ethernet network protocol 
operates at approximately 8 megabits per second. Cunent 
dynamic random access memory and other digital input/ 
output peripherals operate between 32 megabits per second 

45 and 512 megabits per second. Presently available micropro- 
cessors are capable of operation in the low gigabits per 
second range. For example, the "386 Pentium microproces- 
sor manufactured by Intel Corporation of Santa Clara. Calif, 
operates in the lower half of that range, and the Alpha 

50 microprocessor manufactured by Digital Equipment Corpo- 
ration approaches the 16 gigabits per second range. 

When video compression is employed, as expressed 
above, the associated processing overhead reduces the effec- 
tive bandwidth of the particular processor. As a result, in 

55 order to handle compressed video, these processors must 
ojierate in the terahertz frequency range. The bandwidth 
spectrum 60 shown in FIG. 4 represents the effect of 
handling media data streams including compressed video. 
The computing spectrum 66 is skewed down to properly 

60 align the computing bandwidth requirements with the tele- 
communications spectrum 62 and the media spectrum 64. 
Accordingly, current processor technology is not sufficient 
to handle the transmission and processing associated with 
complex streams of multi-media data. 

65 The current specialized processor approach to media 
processing is illustrated in the block diagram shown in FIG. 
5. As shown in FIG. 5, special purpose processors are 
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coupled to a back plane 70, which is capable of transmitting 
instructions and data at the upper kilobits to lower gigabits 
per second range. In a typical configuration, an audio 
processor 76, video processor 78, graphics processor 80 and 
network processor 82 are all coupled to the back plane 70. 
Each of the audio, video, graphics and network processors 
76-82 typically employ their own private or dedicated 
memories 84, which are only accessible to the specific 
processor and not accessible over the back plane 70. As 
described above, however, unless video data streams are 
constantly being processed, for example, the video processor 
78 will sit idle for periods of time. The computing power of 
the dedicated video processor 78 is thus only available to 
handle video data streams and is not available to handle 
other media data streams that are directed to other dedicated 
processors. This, of course, is an inefficient use of the video 
processor 78 particularly in view of the overall processing 
capability of this multi-processor system. 

The general purpose media processor 12, in contrast, 
handles a data stream of audio, video, graphics and network 
information all at the same time with the same processor. In 
order to handle the ever changing combination of data types, 
the general purpose media processor 12 is dynamically 
partitionable to allocate the appropriate amount of process- 
ing for each combination of media in a unified media data 
stream. A block diagram of two preferred general purpose 
media processor system configurations is shown in FIG. 6. 
Referring to FIG. 6(a), a general purpose media processor 
12 is coupled to a high-speed back plane 90. The presently 
prefened back plane 90 is capable of operation at 30 gigabits 
per second. As those skilled in the art will appreciate, back 
planes 90 that are capable of operation at 400 gigabits per 
second or greater bandwidth are envisioned within the spirit 
and scope of the invention. Multiple memory devices 92 are 
also coupled to the back plane 90. which are accessible by 
the general purpose media processor 12. Input/output 
devices 94 arc coupled to the back plane 90 through a 
dual-ported memory 92. The configuration of the input/ 
output devices 94 on one end of the dual-ported memory 92 
allows the sharing of these memory devices 92 throughout 
a network 38 of general purpose media processors 12. 

Alternatively, FIG. 6(b) shows a presently preferred inte- 
grated general purpose media processor 12. The integrated 
processor includes on-board memory and I/O 86. Ine 
on-board memory is preferably of su indent size to optimize 
throughput, and can comprise a cache and/or buffer memory 
or the like. "1 "he integrated media processor 12 also connects 
to external memory 88, which is preferably larger than the 
on-board memory 86 and forms the system main memory. 
Execution Unit 

One presently preferred embodiment of an integrated 
general purpose media processor 12 is shown in FIG. 7. The 
core of the integrated general purpose media processor 12 
comprises an execution unit 100. Three maiu elements or 
subsections are included in the execution unit 100. A mul- 
tiple precision arithmetic/logic unit ("ALU") 102 performs 
all logical and simple arithmetic operations on incoming 
media data streams. Such operations consist of calculate and 
control operations such as Boolean functions, as well as 
addition, subtraction, multiplication and division. These 
operations are performed on single or unified media data 
streams transmitted to and from the multiple precision ALU 
102 over a data bus or data path 108. Preferably the data path 
108 is 128 bits wide, although those skilled in the art will 
appreciate that the data path 108 can take on any width or 
size without departing from the spirit and scope of the 
invention. Ihe wider the data palb 108 the more unified 
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media data can be processed in parallel by the general 
purpose media processor 12. 

Coupled to the multi-precision ALU 102 via the data path 
108, and also an element of the execution unit 100, is a 

5 programmable switch 104. The programmable switch 104 
performs data handling operations on single or unified media 
data streams transmitted over the data path 108. Examples of 
such data handling operations include deals, shuffles, shifts, 
expands, compresses, swizzles, permutes and reverses, 

10 although other data handling operations are contemplated. 
These operations can be performed on single bits or bit fields 
consisting of two or more bits up to the entire width of the 
data path 108. Thus, single bits or bit fields of various sizes 
can be manipulated through programmable operation of the 

15 switch 104. 

Examples of the presently preferred data manipulation 
operations performed by the general purpose media proces- 
sor 12 are shown in FIG. 8. A group expand operation is 
visually illustrated in FIG. 8(a). According to the group 

20 expand operation, a sequential field of bits 270 can be 
divided into constituent sub-fields 272a 14 272rf for insertion 
into a larger field array 274. The reverse of the group expand 
operation is a group compress or extract operation. A visual 
illustration of the group compress or extract operation is 

25 shown in FIG. 8(6). As shown, separate sub-fields 
212a-212d from a larger bit field 274 can be combined to 
form a contiguous or sequential field of bits 270. 

Referring to FIGS 8(c>-8(tf), group deal, shuffle, swizzle 
and permute operations performed by the programmable 

3D switch 104 are also illustrated. The operations performed by 
these instructions are readily understood from a review of 
the drawings. The group manipulation operations illustrated 
in FIGS. 8(a)-$(e) comprise the presently contemplated data 
manipulation operations for the general purpose media pro- 

35 cessor 12. As those skilled in the art will appreciate, either 
a subset of these operations or additional data manipulation 
operations can be incorporated in other alternate embodi- 
ments of the general purpose media processor 12 without 
departing from the spirit and scope of the invention. 

4n Referring again to FIG. 7, higher level mathematical 
operations than those performed by the multi-precision ALU 
102 are performed in the general purpose media processor 
12 through an extended math element 106. The extended 
math element 106 is coupled to the data path 108 and also 

45 comprises part of the execution unit 100. The extended math 
element 106 performs the complex arithmetic operations 
necessary for video data compression and similarly intensive 
mathematical operations. One presently preferred example 
of an extended math operation comprises a Galois field 

50 operation. Other examples of extended mathematical func- 
tions performed by the extended math element 106 include 
CRC generation and checking, Reed-Solomon code genera- 
tion and checking, and spread -spectrum encoding aud 
decoding. As those skilled in the art appreciate, additional 

55 mathematical operations are possible and contemplated. 
According to the preferred embodiment of the integrated 
general purpose media processor 12. a register rile 110 is 
provided in addition to the execution unit 100 to process 
media data. The register file U0 stores and transmits data 

60 streams to and from the execution unit 100 via the data path 
108. Rather than employing a complex set of specific or 
dedicated registers, the general purpose media processor 12 
preferably includes 64 general purpose registers in the 
register file 110 along with one program counter (not 

65 shown). The 64 general purpose registers contained in the 
register file 110 are all available to the user/programmer, and 
comprise a portion of the user slate of the general purpose 
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media processor 12. The general purpose registers are pref- instruction. As many as 255 separate operations are con- 
erably capable of storing any form of data. Each register templated for the preferred embodiment of the general 
within the register file 110 is coupled to the data path 108 purpose media processor 12. As shown in Table I, however, 
and is accessible to the execution unit 100 in the same not all of the operation codes are presently implemented. As 
manner. Thus, the user can employ a general purpose 5 those skilled in the art will appreciate, alternate schemes for 
register according to the specific needs of a particular organizing the operation codes, as well as additional opera- 
program or unique application. As those skilled in the art tion codes for the general purpose media processor 12. arc 
will appreciate, the register file 110 can also comprise a possible. 

plurality of register files 110 configured in parallel in order The instructions provided in the instruction set for the 

to support parallel multi-threaded processing. iu general purpose media processor 12 control the transfer, 

Instruction Set and User Programming processing and manipulation of data streams between the 

Control or manipulation of data processed by the general register file 110 and the execution unit 100. The presently 

purpose media processor 12 is achieved by selected instruc- preferred width of the instruction path 112 is 32-bits wide, 

tions programmed by the user. Those skilled in the art will organized as four eight -bit bytes ('"quadlets"). Those skilled 

appreciate that a great number nf programs are possible is in the art will appreciate, however, that the instruction path 

through various sequences of instructions. Particular pro- 112 can take on any width without departing from the spirit 

grams can be developed for each unique implementation of and scope of the invention. Preferably, each instruction 

the general purpose media processor 12. A detailed discus- within the instruction set is stored or organized in memory 

sion of such specific programs is therefore beyond the scope on four-byte boundaries. The presently preferred format for 

of this description. 20 instructions is shown in FIG. 9(a). 

One presently preferred instruction set for the general As shown in FIG. 9(a), each of the presently preferred 
purpose media processor 12 is included in the Microfiche instruction formats for the general purpose media processor 
Appendix, the contents of which are hereby incorporated 12 includes a field 280 for the major operation code number 
herein by reference. A list of the presently preferred major shown in Table I. Based on the type of operation performed, 
operation codes for the general purpose media processor 12 25 the remaining bits can provide additional operands accord- 
appears below in Table I. ing to the type of addressing employed with the operation. 

TABLE I 



MAJOR OPERATION CODES 

MA- 



JOR 


0 


32 


64 


96 


128 


16C 


192 


224 








major operation code field values 










0 


ERES 


GSHUFFLEI 


FMULADD16 


GMLLADD1 


LLT6LAI 


SAAS54LAI 


EADDIO 


BFE16 


1 


ESHUFFLEI4MUX 


GSHUFFLEI4MUX 


FMULADD32 


GMLLADD2 


LLT6BAI 


SAAS64BAI 


EADDIUO 


BFNUE16 


2 




GSELECTS 


FMULADD64 


GMLLADD4 


LU16U 


SCA554LAI 


ESETIL 


BFNUGE16 


3 


EMDEP1 


GMDEFI 




GMULADDS 


LU16BI 


SCA564BAI 


ESETIGE 


BFNUL16 


4 


EML'X 


GMUX 


FMULSUB16 


GMULAUU16 


LL32LAJ 


SMAS64LA1 


ESET1E 


BFE22 


5 


RSMUX 


GSMt.X 


FMtJl.Stm.T2 


OMUIADDJ2 


U."32BAt 


SMAS64BAI 


F.SFTINF. 


RFMIF32 


6 


EGFMUL64 


GGFMUL8 


FMULSUB64 


GMULADD64 


LU32LI 


SMUX64LAI 


ESETIUL 


BFNUGE32 


7 

S 


ETRANSPOSE8MUX 


GTRANSPOSESMUX 




G EXTRACT! 23 


LU32BI 
L16LAI 


SMUX64BA1 
S16LA1 


ESETIUGE 
ESUBIO 


BFMJL32 
BFEC4 


9 


ESWIZZLE 


GSWIZZLE 




GUMULADD2 


L16BAI 


S36BAI 


ESUBILO 


BFNUE64 


10 




GSWIZZLECOPY 




GUMULADD4 


L16LI 


S16LI 


ESUBIL 


BFNUGC64 


11 




GSWIZZLESWAP 




GOMULADDS 


L16BI 


S36B1 


ESUBIGE 


BFNUL64 


12 


EDEPI 


GDEPI 


F.16 


GUMULADD16 


L32LAI 


S32LAI 


E5UBIE 


BFE12S 


13 


EUDEPI 


GUDEPI 


F.32 


GUMULADD32 


L32BAI 


S32BAJ 


ESUBtNE 


BFNUE123 


14 


EWTHI 


GWTHI 


F.64 


GUMULADD64 


L32U 


S32LI 


ESUBIUL 


BFNUGE12S 


15 


ElIWTHI 


GUWTHI 




GUEXTRACT128 


L32BI 


S32BI 


ESL'BtLTGE 


BFNLTL123 


16 






GFMULADL116 




L64LA1 


S64LAI 


hADUl 


BANDE 


17 






GFMULADD32 


G EXTRACT! 16 


L64BAI 


S64BAJ 


EXORI 


BANDNE 


IS 






GFMULADD64 


G EXTRACT! 32 


L64U 


S64LI 


EORI 


BL/BLZ 


19 






GFMULADD123 


G EXTRACT! 64 


L64BI 


S64BI 


EANDI 


BGE/BGEZ 


20 






GFMULSUB16 


G EXTRACT 


L12SLAI 


S12SLAI 


ESUB1 


BE 


21 






GFMULSUB32 


.1.54 


L12SBAI 


S12SBAI 




BNE 


22 






GFMULSUB64 


G. EXTRACT 


L128LI 


S128U 


ENORI 


BL'L/BGZ 


2? 






GFMULSUB12S 


.1.128 


L12SBI 


S12SB1 


ENANDI 


BUGE/BLEZ 


24 








G.3 


LU3I 


SSI 




BGATEI 


25 








G.2 


LUSI 








26 
?7 








G.4 

G.X 










2H 




FCOPY1 


GF.16 


fi/l 6 






F.COPYI 


BI 


29 






GK32 


G32 








BLiNKJ 


3U 






GK64 


G.64 










31 




&M1NOK 


UK32S 


G.12S 


L.MINOK 


S. MINOR 


E. Ml NOR 


B. MINOR 



As shown in Table I, the major operation codes are grouped 
according to the function performed by the operations. The 
operations arc thus arranged and listed above according to 
the presently preferred operation code number for each 



6? For example, the remainder of the 32-bit instruction field can 
comprise an immediate operand ("imm"), or operands stored 
in any of the general registers ("ra". "rb", "rc w , and "rd"). In 
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addition, minor operation codes 282 can also be included 
among the operands of certain 32-bit instruction formats. 

The presently preferred embodiment of the general pur- 
pose media processor 12 includes a limited instruction set 
similar 10 those seen in Reduced Instruction Set Computer 
(''RISC') systems. The preferred instruction set for the 
general purpose media processor 12 shown in Table I 
includes operations which implement load, store, 
synchronize, branch and gateway functions. These five 
groups of operations can be visually represented as two 
general classes of related operations. The branch and gate- 
way operations perform related functions on media data 
streams and are thus visually represented as block 114 in 
FIG. 7. Similarly, the load, store and synchronize operations 
are grouped together in block 116 and perform similar 
operations on the media data streams. (Blocks 114 and 116 
only represent the above classification of these operations 
and their function in the processing of media data streams, 
and do not indicate any specific underlying electronic 
connections.) A more detailed discussion of these 
operations, and the functionality of the general purpose 
media processor 12, appears in the Microfiche Appendix. 

The four-byte structure of instructions for the general 
purpose media processor 12 is preferably independent of the 
byte ordering used for any data structures. Nevertheless, the 
gateway instructions are specifically defined as 16-byte 
structures containing a code address used to securely invoke 
a procedure at a higher privilege, level. Gateways are pref- 
erably marked by protection information specified in the 
translation lookaside buffer 148 in the memory management 
unit 122. Gateways are thus preferably aligned on 16-byte 
boundaries in the external memory. In addition to the general 
purpose registers and program counter, a privilege level 
register is provided within the register file 110 that contains 
the privilege level of the currently executing instruction. 

The instruction set preferably includes load and store 
instructions that move data between memory and the register 
file 110, branch instructions to compare the content of 
registers and transfer control, and arithmetic operations to 
perform computations on the contents of registers. Swap 
instructions provide multi-thread and mu It i -processor syn- 
chronization. These operations are preferably indivisible and 
include such instructions as add-and-swap, compare -and- 
swap, and multiplex-and-swap instructions. Ihe fixed-point 
corn pare -and-branch instructions within the instruction set 
shown in Table I provide the necessary arithmetic tests for 
equality and inequality of signed and unsigned fixed-point 
values. Tlic In audi through gateway instruction provides a 
secure means to access code at a higher privileged level in 
a form similar to a high level language procedure call 
generally known in the art. 

The general purpose media processor 12 also preferably 
supports floating-point compare -and-branch instructions. 
The arithmetic operations, which are supported to hardware, 
include floating-point addition, subtraction, multiplication, 
division and square root. The general purpose media pro- 
cessor 12 preferably supports other Hoating-point operations 
defined by the ANSI-IEEE floating-point standard through 
the use of software libraries. A floating point value can 
preferably be 16. 32, 64 or 128-bils wide. Examples of the 
presenting preferred floating-point data sizes are illustrated 
in FIG. 9{b). 

The general purpose media processor 12 preferably sup- 
ports virtual memory addressing and virtual machine opera- 
tion through a memory management unit 122. Referring to 
FIG. 10(tf), one presently preferred embodiment of the 
memory management unit 122 is shown. The memory 
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management unit 122 preferably translates global virtual 
addresses into physical addresses by software program- 
mable routines augmented by a hardware translation looka- 
side buffer ("TLB") 148. A facility for local virtual address 
5 translation 164 is also preferably provided. As ihose skilled 
in the art will appreciate, the memory management unit 122 
includes a data cache 166 and a tag caclie 168 that store data 
and tags associated with memory sections for each entry in 
the TLB 148. 

iu A block diagram of one preferred embodiment of the TLB 
148 is shown in FIG. lbY». The TTB 148 receives a virtual 
address 230 as its input. For each entry in the TLB 148, the 
virtual address 230 is logically AND-ed with a mask 232. 
The output of each respective AND gate 234 is compared via 

i s a comparator 236 with each entry in the TT 1 48. IF a match 
is detected, an output from the comparator 236 is used to 
gate data 240 through a transceiver 238. As those skilled in 
the art will appreciate, a match indicates the entry of the 
corresponding physical address within the contents of the 

20 TLB 148 and no external memory or I/O access is required. 
The data 240 for the data cache 166 (FIG. 10(a)) is then 
combined with the remaining lower bits of the virtual 
address 230 through an exclusive- OR gate 242. The result- 
ant combination is the physical address 244 output from the 

25 TLB 148. If a match is not delected between the logical 
address and the contents of the tag cache 168, the memory 
management unit 122 an external memory or I/O access is 
necessary to retrieve the relevant ponion of memory and 
update the contents of the TLB 148 accordingly. 

30 Using generally known memory management techniques, 
the memory management unit 122 ensures that instructions 
(and data) are properly retrieved from external memory (or 
other sources) over an external input/output bus 126 (see 
FIG. 7). As described in more detail below, a high bandwidth 

35 interface 124 is coupled to the external input/output bus 126 
to communicate instructions (and media data streams) to the 
general purpose media processor 12. The presently preferred 
physical address width for the general purpose media pro- 
cessor 12 is eight bytes (64-bits). In addition, the memory 

40 management unit 122 preferably provides match bits (not 
shown) that allow large memory regions to be assigned a 
single TLB entry allowing for fine grain memory manage- 
ment of large memory sections. The memory management 
unit 122 also preferably includes a priority bit (not shown) 

45 that allows for preferential queuing of memory areas accord- 
ing to respective levels of priority. Other memory manage- 
ment operations generally known in the an are also per- 
formed by the memoiy management unit 122. 

Referring again to FIG. 7, instructions received by the 

50 general purpose media processor 12 arc stored in a com- 
bined instruction buffer/cache 118. The instruction buffer/ 
cache 118 is dynamically subdivided to store the largest 
sequence of instructions capable of execution by the execu- 
tion unit 100 without tlie necessity of accessing external 

55 memory. In a preferred embodiment of the invention, 
instruction buffer space is allocated to the smallest and most 
frequently executed blocks of media instructions. ITie 
instruction buffer thus helps maintain the high bandwidth 
capacity of the general purpose media processor 12 by 

60 sustaining the number of instructions executed pei second at 
or near peak operation. That portion of the instruction 
buffer/cache 118 not used as a buffer is, therefore, available 
to be used as cache memory. The instruction buffer/cache 
118 is coupled to the instruction path 112 and is preferably 

65 32 kilobytes in size. 

A data buffer/cache 120 is also provided to store data 
transmitted and received to and from the execution unit 100 



Case 2:05-cv-005G5-TJW Document 1 29 Filed 09/1 2/2007 Page 1 6 o? 21 



5,809. 

17 

and register file 110. The data buffer/cache 120 is also 
dynamically subdivided in a manner similar to that of the 
instruction buffer/cache 118. The buffer portion of the data 
buffer/cache 120 is optimized to store a set size of unified 
media data capable of execution without the necessity of 5 
accessing external memory. In a preferred embodiment of 
the invention, data buffer space is allocated to the smallest 
and most frequently accessed working sets of media data. 
Like the instruction buffer, the data buffer thus maintains 
peak bandwidth of the general purpose media processor 12. 10 
The data buffer/cache 120 is coupled to the data path 108 
and is preferably also 32 kilobytes in size. 

The preferred embodiment of the general purpose media 
processor 12 includes a pipelined instruction pre-fetch struc- 
ture. Although pipelined operation is supported, the general 15 
purpose media processor 12 also allows for non-pipelined 
operations to execute without any operational penalty. One 
preferred pipeline structure for the general purpose media 
processor 12 comprises a u supcr-string" pipeline shown in 
FIG. 11. A super-string pipeline is designed to fetch and 
execute several instructions in each clock cycle. The instruc- 
tions available for the general purpose media processor 12 
can be broken down into five basic steps of operation. These 
steps include a register-to-register address calculation, a 
memory load, a regislcr-lo-rogistcr data calculation, a 
memory store and a branch operation. According to the 
super-string pipeline organization of the general purpose 
media processor 12, one instruction from each of these five 
types may be issued in each clock cycle. The presently 
preferred ordering of these operations are as listed above 
where each of the five steps are assigned letters "A," U L," 
"E," "S" and (see FIG. 11). 

According to the super-string pipelining technique, each 
of the instructions are serially dependent, as shown in FIG. 
11, and the general purpose media processor 12 has the 
ability to issue a siring of dependent instructions in a single 
clock cycle. These instructions shown in FIG. 11 can take 
from two to five cycles of latency to execute, and a branch 
prediction mechanism is preferably used to keep up the 
pipeline filled (described below). Instructions can be 
encoded in unit categories such as address, load, store/sync, 
fixed, float and branch to allow for easy decoding. A similar 
scheme is employed to pre-fetch data for the general purpose 
media processor 12. 

As those skilled in the art will appreciate, the super-string 
pipeline can be implemented in a multi -threaded environ- 
ment. In such an implementation, the number of threads is 
prefciably relatively prime with icspcet to functional unit 
rates so that functional units can be scheduled in a non- 
inlcrfcring fashion between each thread. 

In another more preferred embodiment, a "super-spring" 
pipelining scheme is employed with the general purpose 
media processor 12. ITie super-spring pipeline technique 
breaks the super-string pipeline shown in FIG. 11 into two 
sections that are coupled via a memory buffer (not shown). 
A visual representation of tlie super-spring pipeline tech- 
nique is shown in FIG. 12. 'Ihe front of the pipeline 204. in 
which address calculation (A), memory load (L), and branch 
(B) operations are handled. Is decoupled from the back of 
the pipeline 206, in which data calculation (E) and memory 
store (S) operations are handled. The decoupling is accom- 
plished through the memory buffer (not shown), which is 
preferably organized in a first -in -first -out ( u FIFO" ! ) fast/ 
dense structure. (The memory buffer is functionally repre- 
sented as a spring in FIG. 12.) 

As indicated in Table I above, the general purpose media 
processor 12 does not include delayed branch instructions, 
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and so relies upon branch or fetch prediction techniques to 
keep the pipeline full in program flows around unconditional 
and conditional branch instructions. Many such techniques 
are generally known in the art. Examples of some presently 
preferred techniques include the use of group compare and 
set, and multiplex operations to eliminate unpredictable 
branches; the use of short forward branches, which cause 
pipeline neutralization; and where branch and link predicts 
the return address in a one or more entry stack. In addition, 
the specialized gateway instructions included in the general 
purpose media processor 12 allow for branches to and from 
protected virtual memory space. The gateway instructions, 
therefore, allow an efficient means to transfer between 
various levels of privilege. 

As described above, two basic forms of media data are 
processed by the general purpose media processor 12, as 
shown in FIG. 7. These data streams generally comprise 
Nyquist sampled I/O 128, and standard memory and I/O 
130. As shown in FIG. 7, audio 132, video 134, radio 136, 
network 138, tape 140 and disc 142 data streams comprise 
some examples of digitally sampled I/O 128. As those 
skilled in the art will appreciate, other forms of digitally 
sampled I/O are contemplated for processing by the general 
purpose media processor 12 without departing from the 
25 spirit and scope of the invention. Standard memory and I/O 
130 comprises data received and transmitted to and from 
general digital peripheral devices used in the design of most 
computer systems. As shown in FIG. 7, some examples of 
such devices include dynamic random access memory 
30 ("DRAM*") 146, or any data received over the PCI bus 144 
generally known in the art. Other forms of standard memory 
and I/O sources are also contemplated. The various fixed- 
point data sizes preferred for the general purpose media 
processor 12 are illustrated in FIG. 9(c). 
35 External Interface 

As mentioned above, the general purpose media processor 
12 includes a high bandwidth interface 124 to communicate 
with external memory and input/output sources. As pan of 
the high bandwidth interface 124, the general purpose media 
4n processor 12 integrates several fast communication channels 
156 (FIG. 13) to communicate externally. These fast com- 
munication channels 156 preferably couple to external 
caches 150, which serve as a buffer to memory interfaces 
152 coupled to standard memory 154. Ihe caches 150 
45 preferably comprise synchronous static random access 
memory ("SRAM"), each of which are sixty-four kilobytes 
in size; and the standard memories 154 comprise DRAM "s. 
The memory interfaces 152 transmit data between the 
caches 150 and the standard memories 154. The standard 
50 memories 154 together form the main external memory for 
the general purpose media processor 12. The cache 15U, 
memory interface 152, standard memory 154 and input/ 
output channel 156 therefore make up a single external 
memory unit 158 for the general purpose media processor 

55 12. 

According to the presently preferred embodiment of the 
invention, the memory interface protocol embeds read and 
write operations to a single memory space intu packets 
containing command, address, data and acknowledgment 
60 information. The packets preferably include check codes 
that will delect single -bit transmission errors and some 
multiple -bit errors. As many as eight operations may be in 
progress at a time in each external memory unit 158. As 
shown in FIG. 13, up to four external memory units 158 may 
65 be cascaded together to expand the memory available to the 
general purpose media processor 12, and to improve the 
bandwidth of the external memory. I hrough such cascaded 
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memory units 158, the memory interface 152 provides for 
the direct connection of multiple banks of standard memory 
154 to maintain operation of the general purpose media 
processor 12 at sustained peak bandwidths. 

According to one embodiment shown in FIG. 13, up to 
four standard memory devices 154 can be coupled to each 
memory interface 152. Each standard memory 154 thus 
includes as many as four banks of DRAM, each of which is 
preferably sixteen bits wide. The standard memories 154 arc 
connected in parallel to the memory interface 152 forming 
a 72 -bit wide data bus 160, where 64 bits are preferably 
provided for data transfer and eight bits are provided for 
error correction. In addition to the data bus 160, an address/ 
control bus 162 is coupled between ibe memory interface 
152 and each standard memory 154. The address/control bus 
162 preferably comprises at least twelve address lines (4 
kilobitsxl6 memory size) and four control lines as shown in 
FIG. 13. An alternate manner for coupling the DRAM's to 
the memory interface 152 is illustrated in FIG. 14. As shown 
in FIG. 14, two banks of four DRAM single in-line memory 
modules are coupled in parallel to the memory interface 152. 
The memory interface 152 also supports interleaving to 
enhance bandwidth, and page mode accesses to improve 
latency for localized addressing. 

Using standard DRAM components, I he external memory 
units 158 achieve bandwidths of approximately two 
gigabits/second with the standard memories 154. When four 
such external memory units 158 are coupled via the com- 
munication channel 156, therefore, the total bandwidth of 
the external main memory system increases to one gigabyte/ 
second. As discussed further below, in implementations with 
two or eight communication channels 156, the aggregate 
bandwidth increases to two and eight gigabytes/second, 
respectively. 

A more detailed depiction of the communication channel 
156 circuitry appears in FIG. 15. According to the preferred 
embodiment of the invention, each communication channel 
156 comprises two unidirectional, byte-wide, differential, 
packet -oriented data channels 156<7, 1566 (see FIG. 13). As 
explained above, where memory units 158 are cascaded 
together in series, the output of one memory unit 158 is 
connected to the input of another memory unit 158. The two 
unidirectional channels are thus connected through the 
memory units 158 forming a loop structure and make up a 
single bi-directional memory interface channel. 

Referring to FIG. 15, each communication channel 156 is 
preferably eight bits wide, and each bit is transmitted 
differentially. For example, output transceiver 170 lor bit 
ty>«#r transmits both D n and/D 0 signals over the communi- 
cation channel 156. Additional transceivers arc similarly 
provided for the remaining bits in the channel 156. (The 
transceiver 176 for bit D 7oia and assuciaied differential lines 
178, 180 are shown in FIG. 15.) A CLK^, transceiver 182 
is also provided to generate differential clock outputs 184, 
186 over the channel 156. To complete the link between 
memory units 158, input transceivers 188-192 are provided 
in each memory unit 158 for each of the differential bits and 
cluck signals transmitted over the communication channel 
156. These input signals 172 ; 174, 178, 180, 184, 186 are 
preferably transmitted through input buffers 194-198 to 
other pans of the memory unit 158 (described above). 

Each memory unit 158 also includes a skew calibrator 200 
and phase locked loop ("PLL") 202. The skew calibrator 200 
is used to control skew in signals output to the communi- 
cation channel 156. Preferably, digital skew fields are 
employed, which include set numbers of delay stages to be 
inserted in the output path of the communication channel 
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156. Setting these fields, and the corresponding analog skew 
fields, permits a fme level of control over the relative skew 
between output channel signals. 

The PLL 202 recovers the clock signal on either side of 

5 the communication channel 156 and is thus provided to 
remove clock jitter. The clock signals 184, 186 preferably 
comprise a single phase, constant rate clock signal. The 
clock signals 184. 186 thus contain altera a ting zero and one 
values transmitted with the same timing as the data signals 

iu 172, 174, 178, 180. The clock signal frequency is, therefore, 
one-half the byte data rate. The communication channel 156 
preferably operates at constant frequency and contains no 
auxiliary control,. handshaking or flow control information. 
Each external memory unit 158 preferably defines two 

15 functional regions: a memory region, implemented by the 
cache 150 backed by standard memory 154 (see FIG. 13), 
and a configuration region, implemented by registers (not 
shown). Both regions are accessed by separate interfaces; 
the communication channel 156 is used to access the 

20 memory region, and a serial interface (described below) is 
used to access the configuration region. In the memory 
region, the caches 150 are preferably write-back (write-in) 
single-set (direct -map) caches for data originally contained 
in standard memory 154. All accesses to memory space 

25 should maintain consistency between the contents of the 
cache 150 and the contents of the standard memory 154. The 
configuration region registers provide the mechanism to 
detect and adjust skew in the. communication channel 156. 
Software is preferably employed to adapt ively adjust the 

30 skew in the channel 156 through digital skew fields, as 
explained above. The serial interface thus is used to con- 
figure the external memory units 158, set diagnostic modes 
and read diagnostic information, and to enable the use of a 
high-speed tester (not shown). 

35 One presently preferred embodiment of the invention 
employs two byte-wide packet communication channels 156 
(FIG. 16(a). In order to further increase the bandwidth of the 
general purpose media processor 12, up to sixteen byte-wide 
packet communication channels 156 can be employed. 

40 Referring to FIG. 16(/>), twelve communication channels, 
comprising eight memory channels 210, a ninth channel for 
parallel processing 212 (described below), and three input/ 
output ("I/O") channels 214, are shown. Each of the com- 
munication channels 210-214 preferably employs the cas- 

45 cade configuration of four channel interlace devices 216. 
(Each channel interface device 216 coupled to the memory 
channels 210 corresponds to the external memory unit 158 
slniwu in FIG. 13.) Through each of the twelve communi- 
cation channels shown in FIG. 16(£>), the general purpose 

50 media processor 12 can request or issue read or write 
transactions. When not interleaved, the twelve channels 
provide a single contiguous memory space for each channel 
interface device 216. 

Alternatively, memory accesses may be interleaved in 

55 order to provide for continuous access to the external 
memory system at the maximum bandwidth for the DRAM 
memories. In an interleaved configuration, at any point in 
lime some memoiy devices will be engaged in row pre- 
charge, while others may be driving or receiving data, or 

60 receiving row or column addresses. The memory 7 interface 
152 (FIG. 13) thus preferably maps between a contiguous 
address space and each of the separate address spaces made 
available within each external memory unit 158. For maxi- 
mum performance, therefore, the memory interface is inter- 

65 leaved so that references to adjacent addresses are handled 
by different memory devices. Moreover, in the preferred 
embodiment, additional memory operations may be 
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requested before the corresponding DRAM bank is avail- 
able. In an interleaved approach, these operations are placed 
in a queue until they can be processed. According to the 
preferred embodiment, memory writes have lower priority 
than memory reads, unless an attempt is made to read an 5 
address that is queued for a write operation. As those skilled 
in the art will appreciate, the depth of the memory write 
queue is dictated by the specific implementation. 

Although up to four external memory units 158 arc 
preferably cascaded to form effectively larger memories, iu 
some amount of latency may be introduced by the cascade. 
Packets of data transmitted over the communication channel 
156 are uniquely addressed to a particular channel interface 
device 216. A packet received at a particular device, which 
specifies another module address, is automatically passed to i.s 
the correct channel interface device 216. Unless the module 
address matches a particular device 216, that packet simply 
passes from the input to the output of the interface device 
216. This mechanism divides the serial interconnection of 
interface devices 216 into strings, which function as a single 20 
larger memory or peripheral, but with possibly longer 
response latency. 

In addition to the memory channels 210, the general 
purpose media processor 12 provides several communica- 
tion channels 214 for communication with external input/ 25 
output devices. Referring to FIG. 16(6), three input/output 
channels 214 having SRAM buffered memory (see FIG. 13) 
provide an interface to external standard I/O devices (not 
shown). Like the eight memory channels 210, the three I/O 
channels 214 are byte-wide input/output channels intended 3U 
to operate at rates of at least one gigahertz. The three I/O 
channels 214 also operate as a packet communication link to 
synchronous SRAM memory 208 within the channel inter- 
face device 216. A controller 226 within the channel inter- 
face device 216 completes the interface to the I/O devices. 35 

The three I/O channels 214 preferably function in like 
manner to the memory channels 210 described above. The 
interface protocol for the three I/O channels 214 divides read 
and write operations to a single memory space into packets 
containing command, address, data and acknowledgment 40 
information. The packets also include a check code that will 
delect single-bit transmission errors and some multiple-hit 
errors. According to the preferred embodiment of the 
invention, as many as eight operations may progress in each 
interface device 216 at a time. As shown in FIG. 16(b), up 45 
to four channel interface devices 216 can be cascaded 
together to expand the bandwidth in the three I/O channels 
214. A lul-scrial interface (not shown) is also provided to 
each of the channel interface devices 216 to allow access to 
configuration, diagnostic and tester information at standard 50 
TTL signal levels at a more moderate data rate. (A more 
detailed description of the serial interlace is provided 
below). 

Like the memory channels 210. each I/O channel 214 
includes nine signals — one clock signal and eight data 55 
signals. Differential voltage levels are preferably employed 
for each signal. Each channel interface device 216 is pref- 
erably terminated in a numinal 50 ohm impedance lo 
ground. This impedance applies for both inputs and outputs 
to the communication channel 156. A programmable termi- 60 
nation impedance is preferred. 
Interface Communication 

According to one presently preferred embodiment of the 
invention, the channel interface devices 216 can operate as 
either master devices or slave devices. A master device is 65 
capable of generating a request on the communication 
channel 156 and receiving responses from the communica- 



tion channel 156. Slave devices are capable of receiving 
requests and generating responses, over the communication 
channel 156. A master device is preferably capable of 
generating a constant frequency clock signal and accepting 
signals at the same clock frequency over the communication 
channel 156. Aslave device, therefore, should operate at the 
same clock rate as the coiumuuication channel 156, and 
generate no more than a specified amount of variation in 
output clock phase relative to input clock phase. The master 
device, however, can accept an arbitrary input clock phase 
and tolerates a specified amount of variation in clock phase 
over operating conditions. 

Packets of information sent over the communication 
channel 156 preferably contain control commands, such as 
read or write operations, along with addresses and associated 
data. Other commands are provided to indicate error con- 
ditions and responses to the above commands. When the 
communication channel 156 is idle, such as during initial- 
ization and between transmitted packets, an idle packet, 
consisting of an all-2ero byte and an all- one byte is 
transmitted through the communication channel 156. Each 
non-idle packet consists of two bytes or a multiple of two 
bytes, and begins with a byte having a value other than all 
zeros. All packets transmitted over the communication chan- 
nel 156 also begin during a clock period in which ihc clock 
signal is zero, and all packets preferably end during a clock 
period in which the clock signal is one. A depiction of the 
preferred packet protocol format for transmission ove.r the 
communication channel 156 appears in FIG. 17. 

The general form of e.ach packet is an array of bytes 
preferably without a specific byte ordering. The first byte 
contains a module address 250 ("ma") in the high order two 
bits; a packet identifier, usually a command 252 ("com"), in 
the next three bit positions; and a link identification number 
254 ("lid") in the last three bit positions. The interpretation 
of the remaining bytes of a packet depend upon the contents 
of the packet identifier. The length of each packet is pref- 
erably implied by the command specified in the initial byte 
of the packet. A check byte is provided and computed as odd 
bit-wise parity with a leftward circular rotation after accu- 
mulating each byte. This technique provides detection of all 
single-bit and some multiple-bit errors, but no correction is 
provided. 

'ITie modular address 250 field of each packet is prefer- 
ably a two-bit field and allows for as many as four slave 
devices to be operated from a single communication channel 
156. Module address values can be assigned in one of two 
fashions: cithci dynamically assigned through a configura- 
tion register (not shown), or assigned via static/geometric 
configuration pins. Dynamic assignment through a configu- 
ration register is the presently preferred method for assign- 
ing module address values. 

The link identification uumber 254 field is preferably 
3 -bits wide 3nd provides the opportunity for master devices 
to initiate as many as eight independent operations at any 
one time to each slave device. Each outstanding operation 
requires a distinct link identification number, but no ordering 
of operations should be implied by the value of the link 
identification field. Thus, there is preferably no requirement 
for link identification values 254 to be sequentially assigned 
either in requests or responses. 

'Hie receipt of packets over the communication channel 
156 that do not conform to the channel protocol preferably 
generates an error condition. As those skilled in the art will 
appreciate, the level or degrees to which a specific imple- 
mentation detects errors is defined by the user. In one 
presently preferred embodiment of the invention, all errors 
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are delected, and the following protocol is employed for 
handling errors. For each error delected, the channel inter- 
face device 216 causes a response explicitly indicating the 
error condition. Channel interface devices 216 reporting an 
invalid packet will then suppress the receipt of additional 
packets until the error is cleared. The transmitted packet is 
otherwise ignored. However, even though the erroneous 
packet is ignored, the channel interface devices 216 prefer- 
ably continue to process valid packets ihat have already been 
received and genera le responses thereto. An idenlificaiion of 
the presently preferred commands 252 to be used over the 
communication channel 156 are listed in FIG. 17. 

In the master/slave preferred embodiment, the channel 
interface devices 216 forward packets that are intended for 
other devices connected to the communication channel 156, 
as described above. In slave devices, forwarding is per- 
formed based on the module address 250 field of the packet. 
Packets which contain a module address 250 other than that 
of the current device are forwarded on to the next device. All 
non-idle packets arc thus forwarded including error packets. 
In master devices, forwarding is performed based on the link 
identifier number 254 of the packet. Packets that contain link 
identifier numbers 254 not generated by the specific channel 
interface device 216 are forwarded In order to reduce 
transmission latency, a packet buffer may be provided. As 
those skilled in the art appreciate, the suitable size for the 
packet buffer depends on the amount of latency tolerable in 
a particular implementation. 

A variety of m aster /slave ring configurations are possible 
using the high bandwidth interface 124 of the invention. 
Five ring configurations are cnrrenily preferred: single- 
master, dual-master, multiple-master, single-slave and 
multiple-rnaster/multiple-slave. The simplest ring configu- 
ration contains a single no n -forwarding master device and a 
single non-forwarding slave device. No forwarding is 
required tor either device in this configuration as packets are 
seni directly to ihe recipient. A single-master ring, however, 
may contain a cascade of up to four slave devices (sec FIGS. 
13, 16). In the single-master ring configuration, each slave 
device is configured to a distinct module address, and each 
slave device forwards packets that contain module address 
fields unequal to their own. As discussed above, a single- 
master ring provides a larger memory or I/O capacity than 
a master-slave pair, but also introduces a potentially longer 
response latency. In ihe single-master ring, each slave device 
may have as many as eight transactions outstanding at any 
time, as described above. 

The remaining combinations share many of the above 
basic attributes. In a dual-mastei |?air, each master device 
may initiate read and write operations addressed to the other, 
and each may have up to eight such tr an sac lions outstanding. 
No forwarding is required for either device because packets 
are sent directly to the recipient. A multiple- master ring may 
contain multiple master devices and a single slave device. In 
this configuration, the slave device need not forward packets 
as all input packets are designated for the single slave 
device. A multiple-master ring may contain multiple master 
devices and as many as four slave devices. Each slave device 
may have up to eight transact ions outstanding, and each 
master device may use some of those transactions. In a 
preferred embodiment, a master also has the capability to 
detect a time-out condition or when a response to a request 
packet is not received. Further aspects of interprocessor 
communications and configurations are discussed below in 
connection with FIG. 18. 
Serial Bus 

In one preferred embodiment of the invention, the general 
purpose media processor 12 includes a serial bus (not 
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shown). The serial bus is designed to provide bootstrap 
resources, configuration, and diagnostic support to the gen- 
eral purpose media processor 12. The serial bus preferably 
employs two signals, both at TTL levels, for direct commu- 

5 nication among many devices. In the preferred embodiment, 
the first signal is a continuously running clock, and the 
second signal is an open-collector bi-directional data signal. 
Four additional signals provide geographic addresses for 
each device coupled to the serial bus. A gateway protocol, 

iu and optional configurable addressing, each provide a means 
to extend the serial bus to other buses and de vices. Although 
the serial bus is designed for implementation in a system 
having a general purpose media processor 12, as those 
skilled in the art will appreciate, the serial bus is applicable 

15 to other systems as well. 

Because the serial bus is preferably used for the initial 
bootstrap program load of the general purpose media pro- 
cessor 12, the bootstrap ROM is coupled to the serial bus. As 
a result, the serial bus needs to be opcralional for the first 

20 instruction fetch. The serial bus protocol is therefore devised 
so that no transactions are required for initial bus configu- 
ration or bus address assignment. 

According to the preferred embodiment, the clock signal 
comprises a continuously running clock signal at a minimum 

25 of 20 megahertz. The amount of skew, if any, in the clock 
signal between any two serial bus devices should be limited 
to be less than the skew on the data signal. Preferably, the 
serial data signal is a non-inverted open collector 
bi-directional data signal. TL levels are preferred for com- 

3U mn nication on the serial bus, and several termination net- 
works may be employed for the serial data signal. A simple 
preferred termination network employs a resistive pull-up of 
220 ohms to 3.3 volts above V^. An alternate embodiment 
employs a more complex termination network such as a 

35 termination network including diodes or the "Forced Perfect 
Termination" network proposed for the SCSI-2 standard, 
which may be advantageous for larger configurations. 

The geographic addressing employed in the serial bus is 
provided to insure that each device is addressable with a 

40 number that is unique among all devices on the bus and 
which also preferably reflects the physical location of the 
device. Thus, the address of each device remains the same 
each time the system is operated. In one preferred 
embodiment, the geographic address is composed of four 

45 bits, thus allowing for up to 16 devices. In order to extend 
the geographic addressing to more than J 6 devices, addi- 
tional signals may be employed such as a buffered copy of 
the clock signal or an iuvcitcd copv of ihe clock signal (or 
both). 

50 The serial bus preferably incorporates both a bit level and 
packet protocol. The bit level protocol allows any device to 
transmit one bit of information on the bus, which is received 
by all devices on the bus at the same time. Each transmitted 
bit begins at the rising edge of the clock signal and ends at 

55 the next rising edge. The transmitted bit value is sampled at 
the next rising edge of the clock signal. According to one 
preferred embodiment where the serial data signal is an open 
collector signal, the transmission of a zero bit value on the 
bus is achieved by driving the serial data signal to a logical 

60 low value. In this embodiment, the transmission of a one bit 
value is achieved by releasing the serial data signal to obtain 
a logical high value. If more than one device aiteinpts to 
transmit a value on the same clock, the resulting value is a 
zero if any device transmits a zero value, and one if all 

65 devices transmit a one value. This provides a 'wired-AND" 
collision mechanism, as those skilled in the art will appre- 
ciate. If two or more devices transmit the same value on the 
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same clock cycle, however, no device can detect the occur- 
rence of a collision. In such cases, the transaction, which 
may occur frequently in some implementations, preferably 
proceeds as described below. 

The packet protocol employed with the serial bus uses the 
bit level protocol to transmit information in units of eight 
bits or multiples of eight bits. Each packet transmission 
preferably begins with a start bit in which the serial data 
signal has a zero (driven) value. After transmitting the eight 
data bits, a parity bit is transmitted. The transmission con- 
tinues with additional data. A single one (released) bit is 
transmitted immediately following the least significant bit of 
each byte signaling the end of the byte. 

On the cycle following the transmission of the parity bit, 
any device may demand a delay of two cycles to process the 
data received. Ihe two cycle delay is initiated by driving the 
serial data signal (to a zero value) and releasing the serial 
data signal on the next cycle. Before releasing the serial data 
signal, however, it is preferable to insure that the signal is 
not being driven by any other device. Further delays are 
available by repeating this pattern. 

In order to avoid collisions, a device is not permitted to 
start a transmission over the serial bus unless there are no 
currently executing transactions. To resolve collisions that 
may occur if two devices begin transmission on the same 
cycle, each transmitting device should preferably monitor 
the bus during the transmission of one (released) bits. If any 
of the bits of the byte are receiver! as zero when transmitting 
a one, the device has lost arbitration and must cease trans- 
mission of any additional bits of the current byte or trans- 
action. 

According to the preferred embodiment of the invention, 
a serial bus transaction consists of the transmission of a 
series of packets. The transaction begins with a transmission 
by the transaction initiator, which specifies the target 
network, device, length, type and payload of the transaction 
request. The transaction terminates with a packet having a 
type field in a specified range. As a result, all devices 
connected to the serial bus should monitor the serial data 
signal to determine when transactions begin and end. A 
serial bus network may have multiple simultaneous trans- 
actions occurring, however, so long as the target and initiator 
network addresses are all disjoint. 
Parallel Processing 

In one preferred embodiment of the invention, two or 
more general purpose media processors 12 can lie linked 
together to achieve a multiple processor system. According 
lo this embodiment, general puqx>sc media processors 12 
are linked together using their high bandwidth interface 
channels 124. cither directly or through external switching 
components (not shown). The dual-master pair configuration 
described above can thus be extended for use in mulliple- 
niaster ring configurations. Preferably, internal daemons 
provide for the generation of memory references to remote 
processors, accesses to local physical memory space, and the 
transport of remote references to other remote processors. In 
a mu It i -processor environment, all general purpose media 
processor 12 lun off of a common clock frequency, as 
required by the communication channels 156 that connect 
between processors. 

Referring to FIG. 18, each general purpose media pro- 
cessor 12 preferably includes at least a pair of inter- 
processor links 218 (see also FIG. 16(b)). In one 
configuration, both pairs of inter-processor links 218 can be 
connected between the two processors 12 to further enhance 
bandwidth. As shown in FIG. \S(a) several processors 12 
may be interconnected in a linear network employing the 
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transponder daemons in each processor. In an alternate 
embodiment shown in FIG. 18(b), the inter-processor links 
218 may be used to join the general purpose media proces- 
sors 12 in a ring configuration. Alternatively still, general 

5 purpose media processors 12 may be interconnected into a 
two-dimensional network of processors of arbitrary size, as 
shown in FIG. 18(c). Sixteen processors are couuected in 
FIG. 18(c) by connecting four ring networks. In yet another 
alternate embodiment, by connecting the intcr-proccssor 

iu links 218 to external switching devices (not shown), multi- 
processors with a large number of processors can be con- 
structed with an arbitrary interconnection topology. 

The requester, responder and transponder daemons pref- 
erably handle all inter-processor operations. When one gen- 

15 eral purpose media processor 12 attempts a load or store to 
a physical address of a remote processor, the requester 
daemon autonomously attempts to satisfy the remote 
memory reference by communicating \vi\b the external 
device. The external device may comprise another processor 

20 12 or a switching device (not shown) that eventually reaches 
another processor 12. Preferably, two requester daemons are 
provided each processor 12, which act concurrently on two 
different byte channels and/or module addresses. The 
responder daemon accepts writes from a specified channel 

25 and module address, which enables an external device to 
generate transaction requests in local memory or to generate 
processor events. The responder daemon also generates link 
level writes to the same external device that communicated 
responses for the received transaction request. Two such 

3u responder daemons are preferably provided; each of which 
operate concurrently to two different byte channels and/or 
module addresses. 

The transponder daemon accepts writes from a specified 
channel and module address, which enable an external 

35 device to cause a requester daemon to generate a request on 
another channel and module address. Preferably, two such 
transponder daemons arc provided, each of which act con- 
currently (back-to-back) between two different byte channel 
and/or module addresses. As those skilled in the art will 

40 appreciate, the requester, responder and transponder dae- 
mons must act cooperatively to avoid deadlock that may 
arise due lo an imbalance of requests in the system. Dead- 
locks prevent responses from being routed to their 
destinations, which may defeat the bene fits of a multi- 

45 processor distributed system. 

According to one presently preferred embodiment of the 
invention, the general purpose media processor 12 can be 
implemented as one or mote integrated circuit chips. Refer- 
ring to FIG. 19, the presently preferred embodiment of the 

50 general purpose media processor 12 consists of a four-chip 
set. In the four-chip set, a general purpose media processor 
12 is manufactured as a stand alone integrated circuit. The 
stand alone integrated circuit includes a memory manage- 
ment unit 122, instruction aud data cache/buffers 118. 120, 

55 and an execution unit 100. A plurality of signal input/output 
pads 260 are provided around the circumference of the 
integrated circuit to communicate signals to and from the 
general purpose media processor 12 in a manner generally 
known in the art. 

60 The second and third chips of the four-chip set comprise 
in an external memory element 158 and a channel interface 
device 216. llie external memory element 158 includes an 
interface to the communication channel 156, a cache 150 
and a memory interface 152. The channel interface device 

65 216 also includes an interface to the communication channel 
156, as well as buffer memory 262. and input/output inter- 
faces 264. Both the external memory element 158 and the 
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channel interface device 216 include a plurality of input/ 
output signal pads 2<j0 to communicate signals to and from 
these devices in a generally known manner. 

The fourth integrated circuit chip comprises a switch 226, 
which allows for installation of the general purpose media 5 
processor 12 in the heterogeneous network 38. In addition to 
the plurality of input/output pads 260, the switch 226 
includes an interface to the communication channel 156. The 
switch 226 also preferably includes a buffer 262, a router 
266, and a switch interface 268. iu 

As those skilled in the art will appreciate, many imple- 
mentations for the general purpose media processor 12 are 
possible in addition to the four-chip implementation 
described above. Rather than an integrated approach, the 
general purpose media processor can be implemented in a 15 
discrete manner. Alternatively, the general purpose media 
processor 12 can be implemented in a single integrated 
circuit, or in an implementation with fewer than four inte- 
grated circuit chips. Other combinations and permutations of 
these implementations are contemplated. 20 

There has been described a system for processing streams 
of media data at substantially peak rates to allow for real 
time communication over a large heterogeneous network. 
The system includes a media processor at its core that is 
capable of processing such media data streams. The hctcro- 25 
geneous network consists of, for example, the fiber optic/ 
coaxial cable/twisted wire network in place throughout the 
U.S. To provide for such communication of media data, a 
media processor according to the invention is disposed at 
various locations throughout the heterogeneous network. 3U 
The media processor would thus function both in a server 
capacity and at an end user site within the network. 
Examples of such end user sites include televisions, set-top 
converter boxes, facsimile machines, wireless and cellular 
telephones, as well as large and small business and industrial 35 
applications. 

To achieve such high rates of data throughput, the media 
processor includes an execution unit, high bandwidth 
interface, memory management unit, and pipelined instruc- 
tion and data paths. The high bandwidth interface includes <*n 
a mechanism for transmitting media data streams to and 
from the media processor al rales at or above the gigahertz, 
frequency range. The media data stream can consist of 
transmission, presentation and storage type data transmitted 
alone or in a unified manner Examples of such data types 45 
include audio, video, radio, network and digital communi- 
cations. According to the invention, the media processor is 
dynamically \mi lilioiiablc to priKJCSN any combination or 
permutation of these data types in any size. 

A programmable, general purpose media processor sys- 50 
tern presents significant advantages over current multimedia 
communications. Rather than rigid, costly and inellicienl 
specialized processors, the media processor provides a gen- 
eral purpose instruction set to ease programmability in a 
single device that is capable of performing all of the opera- 55 
lions of tlie specialized processor comb i natron. Providing a 
uniform instruction set for all media related operations 
eliminates the need for a prugrarnmci to lcain several 
different instruction sets, each for a different specialized 
processor. The complexity of programming the specialized 60 
processors to work together and communicate with one 
another is also greatly reduced. Ilie unified instruction set is 
also more efficient. Highly specialized general calculation 
instructions that are tailored to general or special types of 
calculations rather than enhancing communication are elimi- 65 
nated. 
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Moreover, the media processor system can be easily 
reprogrammed simply by traasmitting or downloading new 
software over the network. In the specialized processor 
approach, new programming usually requires the delivery 
and installation of new hardware. Reprogramming the media 
processor can be done electronically, which of course is 
quicker and less costly than the replacement of hardware. 

It is to be understood that a wide range of changes and 
modifications to the embodiments described above will be 
apparent to those skilled in the art and are contemplated. It 
is therefore intended that the foregoing detailed description 
be regarded as illustrative rather than limiting, and that it be 
understood that it is the following claims, including all 
equivalents, that are intended to define the spirit and scope 
of this invention. 

We claim: 

1. A system for unified media processing comprising: 

a plurality of general purpose media processors, each 
media processor being operable ai sustained peak data 
rates and having a dynamically partitioned execution 
unit, wherein a plurality of media data streams are 
concurrently transmitted over a single data path and arc 
dynamically partitioned according (o an elemental 
symbol width that is equal to or narrower than the data 
path, and having a high bandwidth interface, the high 
bandwidth interface coupled to external memory and 
input/output elements to receive, and transmit data to 
the media processor at substantially peak rates; and 

a bi-directional communication fabric, the plurality of 
media processors coupled to the bi-directional commu- 
nication fabric to transmit and receive al least one 
media stream comprising presentation, transmission, 
and storage media information; and 

wherein each media processor further comprises dedi- 
cated memory and wherein the each of the plurality of 
media processors can employ any unused portion of the 
dedicated memory of another media processor in a 
shared manner to efficiently store and retrieve 
presentation, transmission and storage media informa- 
tion at substantially peak data rates. 

2. The system defined in claim 1, wherein the 
bi-directional communication fabric comprises a fiber optic 
network. 

3. The system defined in claim 1, wherein the 
bi-directional communication fabric comprises an heteroge- 
neous network. 

4. Tire system defined in claim 1, where in the 
bi-directional communication fabric comprises a coaxial 
cable network. 

5. The system defined in claim 1, wherein the 
bi-directional communication fabric comprises a wireless 
network. 

6. The system defined in claim 1, whereiu a subset of the 
plurality of media processors comprise network servers. 

7. The system defined in claim 1, wherein the plurality of 
media processors are programmable by downloading pro- 
gram information over Ihc bi-directional communication 
fabric. 

8. The system defined in claim 1, wherein the plurality of 
media processors can access an idle execution unit of 
another media processor in a shared manner to efficiently 
process presentation, transmission and storage media infor- 
mation at substantially peat data rates. 

***** 
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ABSTRACT 
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improving the performance of general purpose processors by 
expanding at least one source operand to a width greater than 
the width of either the general purpose register or the data 
path width. In addition, the present invention provides 
several classes of instructions which cannot be performed 
efficiently if the operands are limited to the width and 
accessible number of general purpose registers. The present 
invention provides operands which are substantially larger 
than the data path width of the processor by using a general 
purpose register to specify a memory address from which at 
least more than one. but typically several data path widths of 
data can be read. The present invention also provides for the 
efficient usage of a multiplier array that is fully used for high 
precision arithmetic, but is only partly used for other, lower 
precision operations. 
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def c-^-PolyMultiplyfsize.a.b) as 
piQ]-*-0 2 * si " 
fork-«-0to size-1 

PM— p[k] A a k ? (O 8128 ^ b|| 0 k ) : 0 2 * 5iza 
endfor 
c^p(size) 
enddef 

defWideMuitiplyMatrixfmajor.op.gsize.rdjc.rb) 
d-*-RegRead(rd t 128) 
c-*-RegRead(rc, 64) 
b-*-RegRead(rb,128) 
Igsize log(gsize) 
if <Vi M -4..o * 0 then 

raise AccessDisallowedByVirtualAddress 

endif 

if c 2..igdz8-3 * 0 Iten 

wsize^-(c and (Q-c))|| 0 4 
t-+~c and (c-1) 

else 

w$ize-*~64 

endif 

lwsize-+~log(wsize) 

W t|wsize^-lg$fe*Jwsizo-3 * 0 then 
msize^tand (0-t))|| 0 4 
VirtAddr-^tand (t-1) 

else 

msfre 28*wsize/gsize 
VirtAddr-*-t 

endif 

case major of 
W.MINOR.B: 

order ^-B 
W.MfNOR.L: 

order -*-L 

endcase 

FIG. 14D-1 
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case op of 

M.MUL.MAT.U.8, W.MULMAT.U.18, W.MULMAT.U.32. 
W.MUL.MAT.U-64: 

ms-<- bs-«-0 
W.MUL.MAT.M.8, W.MULMAT.M.16, W.MULMAT.M.32, 
W.MULMAT.M.64 

m$-+-Q 

bs-*-1 

W.MUL.MAT.8, W.MUL.MAT.16. W.MULMAT.32, 
W.MUL.MAT.64, W.MUL.MAT.C.8, W.MUL.MAT.C.16, 
W.MUL.MAT.C.32. W:MUL.MAT.C.64: 
ms-*-bs-*~1 

W.MULMAT.P.8, W.MUI.MAT.P.16, W.MUL.MAT.P.32, 
W.MUL.MAT.P.64: 
endcase 

m LoadMemory{c,VirtAddr,msize,order) 
h -*-2*gsize 



for i -*-Q to wsize-gsize by gsize 

for j-*-0 to vsize-gsize by gsize 
case op of 

W.MULMAT.P.8, W.MUL.MAT.P.16, 
W.MUL.MAT.P.32, W.MUL.MAT.P.64: 
k -*- i+wsae'jejflsjzB 

qfj+gsize] qlj) A PolyMultiply(gsize,m k ^ si2 e.i.jc 

bj*flaze-1..j) 

W.MUL.MAT.C.8, W.MUL.MAT.C.16. W.MUL.MAT.C.32, 

W.MULMAT.C64: 

if (~i) 4 gsize - 0 then 

k-*-i-(j&gsize)+wsize"j 8 .,ip $ j 2e «. 1 
qfj+gsizeH- q[i] * mul(gsize,h,ms,m,k,b$,b,|) 

else 

k "»+gsize+wsize*j8..igei2e*i 
q[H^sizeJ-^-q[il = mu!(gsize,h,ms,m f k,bs,b,j) 

endif 
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r 



1480 



W.MULMAT.8, W.MULMAT.16. W.MULMAT.32. 
W.MULMAT.64, W.MUL.MAT.M.8, W.MUL.MAT.M.16, 
W.MULMAT.M.32, W.MULMAT.M.64, W.MUL.MAT.U.8, 
W.MULMAT.U.16, W.MULMAT.U.32. W.MUL.MAT.U.64 
q[i-Hjsize] -*-qlj) mul(gsize,h,ms,m,i^wsize* 

is.Jgsize.bs.b,)) 

endfor 

32*flSizo.1*2'L.2 , i -*~<l(V5ize| 

endfor 

RegWrite(rd, 128. a) 
enddef 



FIG. 14D-3 
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Global TB miss 
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1510 



Operation codes 



W.MULMAT.X.B 


Wide multiply matrix extract big-endian 


W.MUL.MAT.X.L 


Wide multiply matrix extract little-indian 



Selection 



class 


op 


order 


Multiply matrix extract 


W.MULMAT.X 


B L 



Format 

W.op.order ra=rc ( rd/b 
ra-wop{rc,rd,rb) 

31 2423 1817 1211 65 0 

I W.op.order | rd | rc j rb I ra 
8 6 6 6 6 



FIG. 15A 
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J1 2423 16151413121110 9 8 0 

1 fsize I dpos |x|sln|m| l|rnd| gssp H 
8 8 111112 9 



FIG. 15B 
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1530 



1023 mirc1(128M28/size) 



\extract/ . 



L 



. Vxtracl 



Xextracy ^ , Yxlrac 



\ px\tax\/ Vxtrac^ \extracV | 
' ♦ j j _j; V 8 

i i i i Vt 



127 



rd(128) 



rb{32) 



128 ra(128) 0 

Wide multiply matrix extract doublets 



FIG. 15C 
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^-1560 



511 rc(64»128/sizel 




128 ra(128) 0 

Wide multiply matrix extract complex doublets 
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Definition ^-1580 
def mul(size l h,vs,v l i t ws.w.j) as * 

muK- {{VS&Vdze.i^jh -si2e|jv s j ze .i^.i) * ((ws&w s i2e.l*j)^2o|lWsi ze . HB( j) 
enddef 



def Wid8MultiplyMatrixExtract(op l ra ( rb t rc l rd) 
d-*-RegRead(rd, 128) 
c-+-RegRead(rc, 64) 
b-*-RegRead(rb, 128) 
case bB 0 ol 
0..255: 

sgsize-*-128 
256.383: 

sgsize-^64 
384.447: 

sgsize-*-32 
44B..479: 

sgsize-^16 
480..495: 

sgsize-*-8 
496..503: 

sgsize-*-4 
504..507: 

sgsize^-2 
508,511: 
sgsize-*-1 

endcase 

signed-*-bi4 
if c 3 .0 * 0 then 

wsize-Hc and (0-c))|| 0 4 

t-*-cand (c-1) 

else 

wsize-*-128 
endif 

if sgstze < 8 then 
gsize-*-8 

elseif sgsize > wsize/2 then 
gsize-«~wsize/2 

else 

FIG. 15E-1 
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gsize-*-sgsize 
endif 

lgsize-*-log(gsize 
Iwsize^log(wsize) 

* f l lwsiie*6-r\-lgsize.Jwstze-3 * 0 then 

mstze-«-(t and (<M)) || 0* 
VirtAddr^-tand (M) 

else 

msize 64*(2-n)*wsize/gsize 
VirtAddr^-t 
endif 

vsize ^-{1*n)*msize*gsize/wsize 

mm LoadMemorytc.VirlAddr.msize.order) 

lmsize-#-log(msize) 

if (VirtAddr| msiw . 4 0 * 0 then 

raise AccessDisallowedByVirtualAddress 
endif 

case op of 

W.MULMATXB: 
order-*- B 
W.MULMATXl: 
order L 

endcase 

ms -4- signed 

ds-*- signed A m 

as-*-signed or m 

spos (be..o) and (2*gsize-1) 

dpos^-(0||i>23^6) and (gsize-1) 

r-*-spos 

sfsize-+-(0|| b 31 24) and (gsize-1) 

tfsize -«-($fsize = 0) or {(sfsize+dpos) > gsize) ? gsize-dpos : sfsize 
fsize (tfsize ♦ spos > h) ? h - spos : tfsize 
if (bio 9 = Z) & -signed then 
rnd-*-F 

else 

rnd-*-bio. 9 
endif 



FIG. 15E-2 
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(r .1580 

for i -*-0 to ws»2e-gsize by gsize 
q|0] 0 2 *9»»*7-lgs«o 

for 0 to vsize-gsize by gsize 
if n then 

if (-) & j & gsize = 0 then 

k-*- i-G49s»ze)+wsize*j 8 , 98ize+1 
q[i+gsize)-«- q(i] + mulCgsize.h.ms.mm.k.ds.d.j} 

else 

k i+gsize+wsize'ja .i ge j ?0+1 
q[i+gsize|-«-q[i] - muKgsize.h.ms.mm.k.ds.d.j) 

endif 

else 

q[i+Ssize)-«-qli] = mul{gsize,h.ms,mm.kj*wsize/gsize,ds,d,j) 

endif 
endfor 
P -*-qI128] 
case rnd of 

none. N: 

s^-0^||-p r ||pf-i 

Z: 

s — 0»w||pr 



s-*-0 h 



end case 

v-t-((ds&ph-l}||p)*(0||s) 

« (vh..r*fsiz9» (as & v t ^z^) h * u ' Mn ) or not I then 

w -*-(aS & VMsize.l)« 8iM - fsii4 - dpo8 ||Vfsi 2 e.W..rllO d P 08 

else 

W ^-.( S ? (v h || - v ^M-<Jpos.1 j . ^slze^posjUQdpos 

endif 
endfor 

3l27..wsi26^° 

RegWrite(ra, 128, a) 
enddef 



FIG. 15E-3 
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j^-1610 

Operation codes 



W.MULMAT.X.L8.B 


Wide multiply matrix extract immediate signed byte big-endian 


W.MULMAt.X.1.81 


Wide multiply matrix extract immediate signed byte little-endian 


W.MULMAT.X.1,16.8 


Wide multiply matrix extract immediate signed doublet big-endian 


W.MULMAT.X.U6.L 


Wide multiply matrix extract Immediate signed doublet little-endian 


W.MUL.MAT.X.I.32.B 


Wide multiply matrix extract immediate signed quadiel big-endian 


W.MUt.MAT.X.I.32.1 


Wide multiply matrix extract immediate signed Quadlet little-endian 


W.MULMAT.X.I.64.8 


Wide multiply matrix extract immediate signed octtets biq-endian 


W.MULMAT,X.i.64.l 


Wide multiply matrix extract immediate signed octlets little-endian 


W.MULMAT.X.LC.8.B 




y matrix extract immediate complex bytes bia-endian 


W.MULMAT.X.I.C.8.L Wide multiply matrix extract immediate complex bytes little-endian 


W.MULMAT.X.I.C.16.B Wide multiply matrix extract immediate complex doublets bio-endian 


W.MULMAT.XJ.C.16.L 




y matrix extract immediate complex doublets lit tie-Midi an 


W.MULMAT.XJ.C.32.B 


Wide multiply matrix extract immediate complex quadlets big-endian 


W.MUL.MAT.X1.C.32.L 


Wide multiply matrix extract immediate complex quadlets fiWe-endian 



Selection 



class 


op 


type 


size 


order 


wide multiply 
extract immediate 


W.MUL.MAT.X.t 


NONE 


8 16 32 64 


LB 


C 


8 16 32 


LB 



Format 

W.op.tsize.order rd=rc,rb ( i 

rd=woptsizeorder(rc,rb,i) 

31 24 23 18 17 12 11 6 5 4 32 0 

l_ W.op.order I rd I rc I rb |t | sz fsbl 
8 6 6 6 1 2 3 

sz-*- tog(size) - 3 
assert size*3 2 i > size-4 
sh i - size 



FIG. 16A 
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r 



1630 



1023 m[i 



\extrac 



cl(128M28/size) 



.extract / | ^attract/' 



, Vxtracl/ , , 



♦ 1 



127 



rd{128) 



128 rd(1 28) 0 

Wide multiply matrix extract immediate doublets 
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r 



1660 



511 rc(64«128/size) 




rb(128) 



128 rd(128) 0 

Wide multiply matrix extract immediate complex doublets 
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_ -1680 

Definition 
def rnul{$ize,h,v$,v,i,ws,wj) as 

mul -^-({vs&vsiw -i+0fc-size|| v*e-W..l) * ({ws&Wsjze^)^**!! VWKJ) 
enddef 



def WideMuttiplyMatrixExtractimmediatetop.type^size.rd.fc.rb.sh) 
c-*-RegRead(rc, $4) 
b-*-RegRead{rb, 128} 
lgsize-*-log{gsize) 
case type of 
NONE: 

j! C| g siza-4..o * 0 then 

raise AccessDisallowedBy VirtualAddress 
endif 

i^ugsize^^Othen 

wsize-«-(c and (0-c})|| (r 
t-#-c and (c-1) 

else 

wsize-«-128 
t-*-c 
endif 

Iwsize -*Hog(wsize) 

if t|wsl28*e.|gaze..lwsii8-3 * 0 then 
msize -«-(t and (0-t))||0 4 
VirtAddr-#-tand(M) 

else 

msize 1 28*wsize/gsize 
VirtAddr^t 

C: 

if *lgsize-4..o * 0 then 

raise AccessDisallowedByVirtualAddress 

endif 

if ca.jgsize-3 * 0 then 

wsize -*-(c and (0-c))||<r 
t«*-c and (c-1) 

else 

wsize-*-128 

endif 

Iwsize-*- log(wsize) 

if ttoaze+5-lgst2e..lw$i2e-3 * 0 ^en 

msize -*-(t and (0-t))|| 0 4 

FIG. 16D-1 
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VirtAddr-*- t and (t-1) ^—1680 

else 

msize -*-64*wsize/gsize 
VirtAddr-«-t 

endif 

vsize 2*m$ize*gsize/wsize 

endcase 
case of of 

W.MULMATXI.B: 

orders B 
W.MULMAT.X.I.L: 
order-*- L 

endcase 

as-*-ms-+-bs-*-1 

m^LoadMemoryfc.VirtAddr.msize.order) 
h <2*gsize) «• 7 - lgsize-(ms and bs) 
r -*-gsize*(sh|||sh) 
for-*-0 to wstze-gsize by gsize 
q[0J-*- 02 # geize+7-lgsize 

for j-*- 0 to vsize-gsize by gsize 
case type of 
NONE: 

q[H>sizeJ ^q{i] + mul(gsize,h,ms,m,i+wsize* 
is.Jgsue.bs.bj) 

C: 

if (-i) & j & gsize = Othen 

k ^O&gsize^wsizelLig^i 
qO*gsizel^qtn + mul(g$ize,h,m$,m.k,bs,b,j) 

else 

k^i+gsize+wsize^jn^i 
qfl+gsizel^qffl - mul(gsize fc h l ms > m,k,bs ) b > j) 
endif 

endcase 
endfor 

p~+-q(vsize] 

s-*-(M| -P r ll PE : * 
v^((as4p M )||p)+(0j|s) 

.r^gstze = (as & v r+ gsj^^)h*i-r-g«za then 

else 

ag*w-M..h«- as ? (v h |hvP M l ) ; Igsize 

endif 
endfor 

»127..wsize 0 
RegWrit»(rtf B 128.a) 
enddef FIG. 16D-2 
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Exceptions 

Access disallowed by virtual address 
Access disallowed by tag 
Access disallowed by global T6 
Access disallowed by local TB 
Access detail required by tag 
Access detail required by local TB 
Access detail required by global TB 
Local TB miss 
Global TB miss 
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^1710 



Operation ?o<tes 



W.MULMAT.C.F.16.B 


Wide multiply matrix complex floating-point half big-endian 


W.MULMAT.C.F.16.L 


Wide multiply matrix complex floating-point httle-sndian 


W.MULMAT.C.F.32.B 


Wide multiply malrix complex floating-point single big-endian 


W.MULMAT.C.F.32.L 


Wide multiply matrix complex floatlnq- point sinqie little-endian 


W.MULMAT.F.16.B 


Wide multiply matrix floating-point hall big-endian 


W.MUL.MAT.F.16.L 


Wide multiply matrix floating-point half little-endian ~} 


W.MULMAT.F.32.B 




W.MULMAT.F.32X 


Wide multiply matrix floating-point single little-endian 


W.MULMAT.F.64.B 


Wide multiply matrix floating-point double big-endian 


W.MULMAT.F.64.L 


Wide multiply matrix floating-point double little-endian 



Selection 



class 


op 


type 


prec 


order 


wide multiply matrix 


W.MULMAT 


F 


16 32 64 


LB 


C.F 


16 32 


LB 



Format 

W.op.prec.order rd=rc t rb 
rd=wopprecorder(rc>rb) 

31 24 23 1817 12 11 65 21 0 

I W.MINOR.ortfer I rd I rc I rb I W.op lIH 
6 6 6 6 4 2 

Pr-*-log(prec)-3 



FIG. 17 A 
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1730 



1023 m[rc] 



128*128/size) 
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1760 
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111 



rb(128) 



0 



128 rd(128) 0 

Wide multiply matrix complex floating-point half 
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Definition 

def mul($ize,v f i,wj) as 

mul^fmul(F(si20 f v s i 2 s.i^..i) I F(si2e,w S j 2e .H.j)) 
enddef 

def WideMulliplyMatrixFloatingPointtniajor.op.gsize.rd.rc.rb) 
c-*- RegRead(rc, 64) 
b-*-RegRead{rb, 128) 
Igsize-*- log(gsize) 
switch op of 

W.MULMAT.F.16, W.MULMAT.F32, W.MULMAT.F.64: 

raise AccessDisallowedByVirtualAddress 
endif 

wsize-«-(c and (0-c))|| 0 
l-*-c and (c-1) 

else 

wsize -«-128 
t-*-c 
endif 

lwsize-«-Iog(wsize) 

W ti ws ize^-lgsfce..lwsize-3 * 0 then 

msize^(tand (0-t))|| 0 4 
VirtAddr^tand (t-1) 

else 

msize -*-128*wsizeyg$ize 
VirtAddr-^t 

endif 

vs ize-*- m$ize*gsize/wsize 
W.MUI.MAT.C.M6, W.MULMAT.C.F.32, W.MULMAT.C.F.64: 

K Cig$>z*-4..0 * 0 then 

raise AccessDisaUowedByVirtualAddress 

endif 

i f c 3JgsiIB .3#0then 

w$ize-«-(c and (0-c))|| 0 4 
t-*-c and (c-1) 

else 

w$ize-*-128 
endif 

lwsize-*-log(wsize) 

if ttwsi28*5-lgsize..rwsize-3 * 0 then 

FIG. 17D-1 
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rasize^- (t and (0-t))|J 0 4 * 
VirtAddr^t and (1-1) 

else 

nisize 64*wsize/gsize 
VirtAddr-*-t 

endif 

vsize -*-2*msize*gsize/wsize 

endcase 
case major of 
M. MINOR. B: 

orders B 
M.MIN0R.L: 
order-*- L 

endcase 

rn LoadMemory (c.VirtAddr.msize ,order) 
for to wsize-gsize by gsize 
q{0].t^NULL 

for j-«-0 to vsize-gsize by gsize 
case op of 

W.MULMAT.F.16, W.MULMAT.F.32, W.MULMAT.F.64: 
qO+gsize) faddqOJ, mul(gsize,rTu+wsize* 

W.MUL.MAT.C.F.16. W.MULMAT.C.F.32, 

W.MULMAT.C.F.64: 

if H) & j & gsize = 0 then 

k -M-tf&gstzeJ-Hwsize-ja lgsjw , 1 
qti+gsize] *-faqq(j], muKgsize.m.k.bj)) 

else 

k-*- i^size-wsize*j 8 .j gs j 2e ^i 
qD+gsize] -«-fsubqlj]. rmiKgsize.m.k.b.j)) 
endif 

endcase 
endfor 

agB«ze-i*u-*- qlvsizej 
endfor 

ai27..wstee"*— 0 
RegWriie(rd f 128, a) 
enddef 



FIG. 17D-2 
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Exceptions 

Floating-point arithmetic 
Access disallowed by virtual address 
Access disallowed by tag 
Access disallowed by global TB 
Access disallowed by local TB 
Access detail required by tag 
Access detail required by local TB 
Access detail required by global TB 
Local TB miss 
Global TB miss 
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1810 



Operation codes 



W.MUL.MAT.G.8.B 


Wide multiply matrix Galois bytes big-endian 


W.MULMAT.G.B.l 


Wide multiply matrix Galois bytes little-endian 



Selection 



class 


op 


Size 


order 


Multiply matrix Galois 


W.MULMAT.G 


8 


B L 



Format 

W.op.order ra=rc,rd,rb 

ra=woporder{rc,rd.rb) 

31 24 23 18 17 12 11 6 5 0 

1 W.op.order I rd I rc I rb I ra 1 
8 6 6 6 6 



FIG. 18A 
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Definition ^-1860 
def c-+- PolyMultipIy(size,a,b) as r 

for k-«-0 to size-1 

p[k+1l^plkl * a k ? (0*^||b|| 0 k ) : (J 2 '*** 
endfor 
c-«-p{size] 
enddef 

def c**-Po!yResidue(size,a,b) as 
p(0] — a 

for k-*-size-1 to0by-1 

p[k-1]^p{kl A P[0] si28 . k ?(0***|1 b|| 0* ) : 0 2 ' slz * 
endfor 

c^Plsize]si Za .v.O 
enddef 

def WideMultiplyMatfixGaloistop.gsize^d^c.rb/a) 
d-*-RegRead(rd, 128) 
c-*~RegRead(rc, 64) 
b-*-RegRead(rb,128) 
lgsize-*-log(gsize) 
if cigsj2e.4..0 * 0 then 

raise AccessDisallowedByVirttiaiAddress 

endif 

if c 3 .. f g Sl20 .3 * 0 then 

wsize -*-(c and (0-c))||0 4 
t-#-c and (c-1) 

else 

w$ize-*-128 

endif 

lwsize-*-tog(wsize) 

if t lwsize*6-lgaze. Jwetae-3 * 0 then 

msize-«-(t and (0-t)) || 0 4 

VirtAddr-*-tand(M) 

else 

msize -»-128Nvsize/gsize 
VirtAddf-^t 

endif 

case op of 

W.MULMAT.G.8.B: 

order-*- B 
W.MUL.MAT.G.8.L: 

order L 

endcase f/ £ ^ C _ f 
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^1860 



m LoadMemory(c, VirtAddr,msize,ofder) 
for w$tze-gsize by gsize 

for lo vsize-gsize by gsize 
k^i+wsize'jajgsue 

qO*gsizel^qOJ A PofyMulUply(gsize t rm ( ^ S i Z o.i..k ,dj^z»-i„j ) 
end for 

agsi 2 e-i*i..i ^PolyRe$idue(g$ize.qIv$ize].bgsize-i..o ) 
endfor 

RegWrite(ra,128, a) 
enddel 



FIG. 18C-2 
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Exceptions 

Access disallowed by virtual address 
Access disallowed by tag 
Access disallowed by global TB 
Access disallowed by local TB 
Access detail required by tag 
Access detail required by local TB 
Access detail required by global TB 
Local TB miss 
Global TB miss 
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Operation codes 



E.MULADD.X 


Ensemble multiply add extract 


E.C0N.X 


Ensemble convolve extract 



Format 

E.op rd@rc.rb.ra 
rd=gop(rd,rc,rb.ra) 

31 24 23 



8 



18 17 



I 'd I 



rc 



12 11 

zn 



rb 



6 5 



ra 



FIG. 19A 
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Figures 19B and 20B has blank fields: should be. 



fsize 



dpos 



Einnpnrnn 



gssp 



FIG. 19B 
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r 



1930 



127 



^ Vxtfact / 1 ' Vxtracj/ Extract/ ' 



rc(128) 



en 



\extract/ , , \pxUact/ 1 .N pxtracl/ , ,\ pxtrac|/ , r 



:* 



127 



rb(128) 



i 0 



\exUac^ / 



iiii rn 



128 rd(128) 0 

Ensemble multiply add extract doublets 



FIG. 19C 
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0 
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0 
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: o 
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0 


o 




0 






: 0 


0 


0 


0 


o 


<> 


<> 


< » 




1 


a 


0 


0 


0 


( 1 








Bxtra 




1 r 

\extra 




1 f 

spxtra 




gxtrac 




t 


Wrac/ 


Vpxtracl/ ^ 







rb(128> 



128 rd(128) 0 

Ensemble complex multiply add extract doublets 

This ensemble-multiply-add-extract instructions (E.MULADD.X), when 
the x bit is set, multiply the tow-order 64 bits of each of the rc and rb 
registers and produce extended (double-size) results. 



FIG. 19D 
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128 rd(128) 0 



Ensemble convolve extract doublets 



FIG. 19E 
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Ensemble convolve extract complex doublets 



FIG. 19F 
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Definition ^-1990 
def mul{size,h ,vs,v, i ,ws,w, j) as * 

enddef 



def EnsembleExtracttnplace(op ja.rb/c.rd) as 
d-+~RegRead(rd, 128) 
c-*-RegRead(rc, 128) 
b-«-RegRead(rb, 128) 
case be,.o of 
0..255: 

sgsize -*-1 28 
2S6..383: 

sgsize -*-64 
384. .447: 

sgsize -*-32 
448..479: 

sgsize -*-16" 
480.. 495: 

sgsize -*-8 
496..503: 

sgsize -#-4 
504..507: 

sgsize -*-2 
508..S11: 

sgsize -+-1 

endcase 

n-^ai3 
signed -#-ai 4 

case op of 

E.CON.X: 

if (sgsize < 8} then 

gsize-*- 8 
elseif (sgsize # {n-1)*(x+1) > 128 then 
gsize^128/(n-1)/(x*1) 

else 

gstze-*- sgsize 

endif 

lgsize-«-log(gsize) 
wsize 128/(x+1) 



FIG. 19G-1 
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vsize -*-128 
ds-*-cs-*- signed 
bs-#~ signed A m 
zs signed or m or n 
zsize-*-gsize*(x+1) 
h-*-{2 # gsize) + log (vsize) - Igsize 
spos-*- (aa.o) and (2*gsize-1) 

E.MULADD.X: 

tf(sgsize < 9) then 

gsize-*-8 
elseif (sgsize*(n+inx+1) > 128) then 

gs>ze-*-128/(n*1)/(x+1) 

else 

gsize-*-sgsize 
endif 

ds^*- signed 
cs-*- signed A m 
zs-*- signed or m or n 
zsize**- gsize*(x+l) 
h-*-(2*g$ize) + n 
spos-*-(a B 0 ) and {2*gssize-t) 
endcase 

dpo$-*-(0|! a 2 & is) and (zsize-1) 
r-*-spos 

sfsize-«-(0|| a 31 24) and (zsize-1) 
tfsize-*- (sfsize = 0) or ((sfsize-nJpos) > zsize) ? zsize-dpos : sfslze 
fsize -*-(tfs»zB + spos > h) ? h - spos : tfsize 
if (bio 9 = Z) and not as then 
rnd-^F 

else 

rnd-*-bio..9 
endif 




FIG. 19G-2 
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^1990 

for k-*- 0 to wsize-zsize by zsize 
i^«-k*gsize/zsize 
case op of 

E.C0N.X: 

for j-*- 0 to vsize-gsize by gsize 
if n then 

gsize = 0 then 
q(j*gsize)-«-q(j] + mul(g$ize f h,ms t m l M- 

else 

qfl-K)$ize)-4-q|jl • muKgsize.h.ms.H- 
t28-j+2*gsize,bs,bj) 
endif 

else 

qQ+gsize] -*-qUl + mu!tgsize t h.ms,m,i+ 
128-j.bs.bj) 

endif 
endfor 

p-«~qlv$ize} 
E.MULAO0.X: 

di -*-((dS and dk+zize-l )h-zsize-r || (dk-zsize-t * )|| 0 r ) 
if n then 

if ( i end gsize) = 0 then 

p^multgsize.h.ds.d.i.cs.c.i)- 
fnuUgsize.h.ds.dj+gsize^cs.c.i^gsizeJ-KJi 
else 

p^mul(g$ize,h,d$,dj.csxi^size)4fnu^ 
endif 

else 

p-#- muitgsize^.ds^J.cs.ci) + di 

endif 

endcase 



FIG. m-3 
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1990 



case rnd of 

N: 

s-»-0 h 

C: 

s^o h "lli f 

endcase 

v^((z5&p M )}jp) + (0|| S ) 

cvtU^I^ & Vffsi 2 «.i)h*i-f-is«e) or not ( , and (o s 

caI KACT)) then 

else W ^ ( " & ^^" ^'^^l^toe-w.rll 0<«po« 
^ ^ w-«-(2S ? (VhlJ-v^ze-dpos-l) . iza2eH) P os)|| 0 dpo8 



endfor 

RegWrite(rd. 128, z) 
enddef 



F/6. 19G-4 
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^2010 


Ooeration codes 






E.MULX 


Ensemble multiply extract 


E.EXTRACT 


Ensemble extract 


E.SCALADD.X 


Ensemble scale and extract 


Format 

E.op ra=rd,rc.rb 
ra=eop(rd,rc,rb) 
31 24 


23 18 17 12 11 


6 5 0 


I E.op 


I rd Ire I rb 


I ra | 


8 


6 6 6 


6 



FIG. 20A 
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Figures 19B and 20B has blank fields: should be. 



fsize 



I 



dpos 



QSSp 



FIG. 20B 
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extra 

1 r 


icy/ 






r \extra 




yfXtf 
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r 


pxtrac 




spxtract/ 

X 
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X 


X 
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128 
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2030 



127 




mm r A 




rd(12B) 




, , \»xtr act /, piracy /, ^xtra57 , t 



\»xtracl/ N pxtract/ Vxt^ct/ \pxtracj/ 




127 



rc(128) 



czm 

128 ra(128) 0 

Ensemble complex multiply extract doublets 

This ensemble-multiply-extract instructions (E.MULX), when 

the x bit is set. mutUply the tow-order 64 bits of each of the rc and rb 

registers and produce extended (double-size) results. 



FIG. 20D 
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2040 



127 



127 



\extract/ 



rd(128) 
rc(128) 



95 
80 

79 
64 



rb(128) 



j^pxtp7|\extrac/ , \exlracy ' , 
\e«racl/ Nextract/ Xexlract/ \extract/ 

128 ra(128) 0 

Ensemble scale add extract doublets 



FIG. 20E 
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2050 




pxUacl/ , , \extrac{/ , p \p»tract/ 



Npxliact/ 



\exilrac7 

T 



rb(128) 



f Vxtrac^ / 



I 



1 r 



\ px tract/ 



128 



ra(128) 



E__J 



Ensemble complex scale add extract doublets 

The ensemble-scate-add-axtract instructions (E.SCLA0D.X), when the x bit 
is set, multiply the low-order 64 bits of each of the rd and re registers by the 
rb register fields and produce extended (double-size) results. 



FIG. 20F 
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2060 



fsize spoc 




m fsize ^ dpos 

Ensemble extract 



FIG. 20G 
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2070 



fsize spoc 
- — 



st 



rd 



rc 



ra 



\ 



fsize 



m dpos 



Ensemble merge extract 



FIG. 20H 
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2080 



fsize spos 
« » « — 



St 




rd 


rd 






gsize\ 


► 






\ 




\ 






1 5 


a 


0 


ra 



fsize dpos 



Ensemble expand extract 



F/G. 201 
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Definition 2090 
det mul($ize,Ms.v .i.ws.wj) as f 

def EnsembleExtfact(op,ra,fb ( rc,rd) as 
d-*-RegRead(rd, 128) 
c-*-RegRead(rc, 128} 
b-*-RegRead(rb, 128) 
case bg. 0 of 
0. 255: 

sgsize-*-128 
256.383: 

sgsize-*-64 
3B4..447: 

sgsize-«-32 
44B..479: 

sgsize-^-16 
480..495: 

sgsize-«-8 
496..503: 

sgsize-^-4 
504..507: 

$gsize-«-2 
508.511: 

sgsize-^1 

endcase 
N-bn 

signed-*- bu 

case op of 

^EXTRACT: 

gsize + sgsize # 2(2-(m or x)) 
zsize**- sg$lze 
h-^- gsize 
as signed 

spos^(t8..o) and (gsize-1) 



FIG. 20J-1 
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E.SCALADD.X: /"^ 
if (sgsize < 8) then 

gsize-*- 8 
elseif <sgsize*(n+1) > 32) then 

gsize-»-32/(nv|) 

else 

gsize-*- sgsize 
endif 

ds-*- cs-*- signed 
bs-#- signed A m 
as-*- signed or m or n 
zsize -"*-gsize*(x-»-1) 
h-*-(2'gsize) + 1 «-n 
spos (b&.o) and (2*gsize-l) 
E.MUL.X: 

if (sgsize < 8) then 

gsize-»-8 
elseif (sgsize*(n*1)*(x*1) > 128) then 
gsize-*-128/(n*-1)/(xvt) 

else 

gsize «*- sgsize 

endif 

ds signed 

cs-«- signed *m 

as signed or m or n 

zsize -*-gsize*(x*1) 

h-*-(2*gsize)+n 

spos-*- (be.. 0) and (2*gsize-1) 

endcase 

dpos (0|| b 2 3..ie) and (zsize-1) 
r-*-spos 

sfsize — « — (0|| b3i..24) and <zsize-l) 

tfsize-*-(sfsize *0) or ((sfsize+dpos) > zsize) ? zsiza-dpos : sfsize 
fsize-*-(tfsize spos > h) ? h - spos : tfsize 
if (bio 9=Z) and not as then 
md-»-F 

etse 

rnd-*- b 
endif 



FIG. 20J-2 
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(or j-*- 0 to 128-zsize by zsize 




i-*- j'gsize/zsize 
case op of 

E. EXTRACT: 
if m or x then 

p-*- dgsize*M..i 

else 

p-*- (d|| c)gsJzfKi-1. a 

endif 
E.MULX: 
if n then 

if (i and gsize) = 0 then 

p^mul(gsize,h 1 ds,dj,cs,cj)- 
mU{gsize f Ms,d,i-*gsize,cs f c i i+gsize) 



muKgsize.h^s.dJ^s^j^sizel+muKgsize^^ds.d/i.cs^j+gsize) 
endif 

else 

p mulfgsize.h.ds.dj.cs.c.i) 
endif 
E.SCAL.AOD.X; 
if n then 

if (i and gsize) = 0 then 



p mul(9Size.M$ i d f i,bs f b,B4+2*g$ize) 
♦ fmjt(g$ize.h,c$.c,Lbs,b,64) 
-mui(gsize ( h,ds f d t i-^gsize l b$,b t 64^3 # gsize) 
-muligsize.h.cs.c/i^ize.bs^^-^size) 



p mul(gsize f h l ds l d > i»bs t b l 64+3*gsi2fl) 
+ mul(gsize v h v cs.c f i.bs ( b t 64^st26) 
+ mul(gsize,h,d$,d^$ize,b$,b,64*2*gsize) 
+mul(gsfre l h.cs f c.i+95ize l b5,b l 64) 

endif 

else 

p^_ mul(gsize t h,d3 l d,i l bs,b,64+gsfee) + mul(gsize 
.h.cs.cj.bs.b^) 



else 



else 



endif 



endcase 



FIG. 20J-3 
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caserndof 2090 
N: ~ 
S — 0 h -'!h Pf J|pr-1 

C: 

s— -oMl V 

endcase 

v— ((as&p h . 1 )||p)+(0lls) 

'f (Vh..ffsiza a (as & v r 4isi29.i) h * , - r * ,siM ) or not (I and (op = 
E.EXTRACT)) then 
W-*- (as & Vr^e.Oza^B-fsizeKJposjl^^^ f || 0 dpo S 

else 

w ( S ? (v^ll -yisize-dpos-l j . ^size-dposj y Qdpos 
endif 

if m and (op = E.EXTRACT) then 

z««e-1*j..j C aS i Z e.H. dpos^lsizHllWdpo^si^.,, >dp08 || 
Capoe-H..] 

else 

Zzsize-H-.j-*-* 
endif 
endfor 

RegWritetra, 128. z) 
enddef 



FIG. 20J-4 
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Typical dynamic-linked, inter-gateway calling sequence: 
caller: 



caller 



calee: 





spig-size 


It dllUCdlc UdUtrl z>I«tCK 11 dlTlc 




ip,SP,0«i 






op.sp.on 






tn — i4a riff 


;/ load In 


I I fid A 




// load do 








L 164 A 


rln off 




...(code using dp) 






L.I.64.A 


lp=sp,off 


// restore original Ip register 


A.AO0I 


sp=size 


// deallocate caller stack frame 


B 


IP 


// return 


(non-leaf): 






I.I.64.A 


dp=dp.off 


// load dp with data pointer 


S.I.64.A 


sp,dp,off 




L.I.64.A 


sp=dp,off 


// new stack pointer 


S.I.64.A 


lp,sp,off 




S.I.64.A 


dp.sp.off 




...(using dp) 






LI.64.A 


dp.sp.off 




...(code using dp) 






U.64.A 


lp=sp,off 


// restore original Ip register 


L.I.64.A 


sp=$p,off 


// restore original sp register 


B.DOWN 


«P 





callee (leak, no stack): 



caliee: ...(using dp) 
B.OOWN 



FIG. 21 B 



Case 2:05-cv-00505-TJW Document 129 Filed 09/1 2/2007 Page 39 of 40 
U.S. Patent Apr. 20,2004 Sheet 76 of 148 US 6,725,356 B2 



2160 



Operation codes 



[B.GATE | Branch gateway j 


Eauivalencies 


| B.GATE — B.GATE 0 


I 


Format 




B.GATE rb 




bgate(rb) 




31 24 23 18 17 12 11 


6 5 0 


| B.MINOR | 0 |1 I rb 


I B.GATE I 



8 6 6 6 6 



F/6. 21C 
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r3 w3 x2 g3 




,rb=0 




code 




rd=(h 














pc||pP 




/r2 w2 x3 gO / 




gate ' 2 




f2 w2 x3 g3 


datap _ 





data 



Branch gateway 
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Definition 

def BranchGateway(rt,rc.rt>) as 

c«-RegRead(rc.64) 
b 4- RegRead(rt), 64) 

if(rd*0)or(rc*l)iiiefl 

raise Rese/vedlnstruction 

endrf 

raise AccessOisaltowedByVirtualAddress 

endif 

d «- ProgramCounter 63 2 * 1 (| PrivtfegeLevel 
«f PrivilegeLevel < bi.. 0 then 

m «- LoadMemoryG( Cl c.64, L) 
if b * m then 

raise GatewayDisattowed 

endif 

PrivitegeLevet*-!^ 0 

endif 

ProgramCounter «- ? J| 0 2 
RegWrite(rd, 64. d) 
raise TakenBranch 
enddef 
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Gateway disallowed 
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Access disallowed by global TB 
Access disallowed by local TB 
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Local TB miss 
Global TB miss 
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Operation codes 



ELSCALADD.F.16 


Ensemble scale add floating-point half | 


E.SCALADD.F.32 


Ensemble scale add floatiro-pomt sinqle \ 


E.SCALADO.F.64 


Ensemble scale add floating-point double 



Selection 



class 


op 


prec 


scale add 


E.SCALADD.F 


16 32 64 



Format 

E.op.prec ra=rd.rc.rb 
raseopprectrd.rcjb) 

31 24 23 18 17 12 11 65 0 

I E.op.prec I rd | rc | rb | ra 1 

8 6 6 6 6 
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def EnsemtoeRoatingPointTerw as 
d <- RegRead(rd, 128) 
c RegReacKrCr 128) 
b 4- RegRead(rt. 128) 
for I «- 0 to 128-prec by prec 

di «- F(prec,dHpr«>lj) 

ci^-F(precc|^ )rec . 1J ) 

ai <- fackKfrm^dl, FCprecbp^LxW. fmuUri, F(precb 2>ec . 1 ..p rac ))) 
«l*prec-i J «- PackF(prec, ai, none) 
endfor 

RegWrite(ra, 128. a) 
enddef 
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Openfton codes 

1 G.8QOLEAN 1 Group boolean 



SetectfiQ 



operation 


function (binary) 


function (decimal) 


d 


11110000 


240 


0 


11001100 


204 


b 


10101010 


176 


d&c&b 


10000000 


128 


(d&c)|b 


11101010 


234 


diclb 


11111110 


254 


d?c:b 


11001010 


202 


dV^b 


10010110 


150 




01101001 


105 


0 


00000000 


0 



Format 

G.BOOLEAN rd@trc,trb.f 

rd=gboo!eani(rd.rc.rb.f) 

31 252423 18 17 12 11 65 0 

j G.BOOLEAN \\h\ rd | rc | rb | il | 



FIG. 23A 



Case 2:Q5-cv-00505-TJW Document 129 Filed 09/12/2007 Page 6 of 40 



U.S. Patent Apr. 20, 2004 Sheet 83 of 148 US 6,725,356 B2 



2320 



' f f6-f 5 then 

if f 2 = f 1 then 

if f2 then 



else 



endif 



rc <- max(trc,trb) 
rb 4- min(trc.trb) 

else 

rc 4- min(trc,tib) 
rb 4- max{trc.trb) 

endif 

ih 0 

i'*-0|| f 6 || || f4||f 3 ||f 0 

if f2 then 

rc*- trb 
rb «- trc 

else 

rc <- trc 
rb*- trb 

endif 
ih<-0 

il ^1||f6||f7||f4l|f3l|f0 



else 



ih«- 1 
if f6 then 

rc <- trb 
rb 4- trc 

Hf2l|f7||f4|j f3 II fo 

else 

rc 4- trc 
rb <- trb 

•'<-f2l| fl I|f7||f4|| f3 II f0 

endif 



endif 



FIG. 238 
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Definition 

def GroupBoolean (ih,rd.rc.fb.il) 
d<-RegRead(Kj. 128) 
c «- RegRead(rc. 128) 
b <- RegRead(rb. 128) 
ifih=Othen 

if Hs=0 then 

f «- «3 II >W il 94 ll *z 1 9i 11 (rorb) 2 1| Do 

else 

' f*-«ba»4niuiiu2iiiiinoiiiiiiio 

enctif 

" f4-il3l|Onittit2«iliiIil5ll84H«0 
endif 

for 14-0 lo 127 by size 

a l *- f Wil|CiHbi) 
endfor 

RegWrite(rd, 128, a) 
enddef 



FIG. 23C 
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Operation codes 

IB.HINT | Branch Hint \ 

Formal 

B.HINT barJd.count.rd 
bhint(badd.count,rd) 

31 24 23 1817 1211 65 O 

1 BJWtNOR I rd 1 count 1 simm I BjjjMf 1 

8 6 6 6 6 

simm •*- badd-pc-4 



FIG. 24A 
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Definition 

def BranctiHlnt(fd.countshim) as 
d RegRead<rd. 64) 
if(dl..o)»Othen 

raise AccessOIsaBawedByVlrtualAddress 

erafif 

FetchHint(ProgramCounter +4 + (0 1| simm 0 0 2 ), d&JZ II 0 2 . count) 
enddef 



FIG. 24B 
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Exceptions 

Access disallowed by virtual address 



FIG. 240 
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Operation codes 



E.SINK.F.16 


cnsemoie convert uQaimg-pomt oouoiets trom half nearest default 


E.SINK.F.16C 


tnsemoie convert iioaunq-point doublets from naif ceiling 1 


E.S1NK.F.18.C.D 


cnsemoie convert noatinq-point doublets from half ceilinq default 


E.SINK.F.16.F 


Ensemble convert floating-point doublets from half floor 


ESINK.F.16.F.D 


cnsemDie convert floating-point doublets from naif floor default 


E.SINK.F.I6.N 


cnsemoie convert floating-point doublets from half nearest 


E.SINK.F.16.X 


cnsemoie convert noatinq-poim doublets from half exact 


E.SINK.F.16.Z 


Ensemble convert floating-point doublets from half zero 


E.SINK.F.16.Z.D 


cnsemoie conven noaung-point doublets from half zero default 


E.S1NK.F.32 


Ensemble convert floatingpoint Quadlets from single nearest default 


E.SINK.F.32.C 


cnsemoie convert Tioaung-poifu quae lets from smote ceiunq 


E.SINK.F.32.C.D 


cnsemoie convert twating-point quadlets from single ceibng default 


E.SINK.F.32.F 


cnsemoie conven floating-point quadlets from single floor 


E.SINK.F.32.F.D 


Ensemble convert floating-point quadlets from single floor default 


LSINK.F.32.N 


Ensemble convert floating-point quadlets from sinqle nearest * 


E.SINK.F.32.X 


Ensemble convert floating-point quadlets from single exact 


E.S1NK.F.32.Z 


Ensemble convert floatino-ooint quadlets from sinqle zero 


E.SINK.F.32.2.0 


Ensemble convert floating-point Quadlets from sinqle zero default 


E.SINK.F.64 


Ensemble convert floating-point octlets from double nearest default 


E.SINK.F.64.C 


Ensemble convert floating-point octlets from double ceifing 




| Ensemble convert floating-point octlets from double ceiUna default 


E.SINK.F.&4.F 


Ensemble convert floating-point octlets from double floor 


E.SINK.F.64. F.O 


Ensemble convert floating-point octlets from double floor default 


6.SINK.F.64.N 


Ensemble convert floating-point octlets from double nearest 


E.SINK.F.64.X 


Ensemble convert floating-point octlets from double exact 


E.SINK.F.64.Z 


Ensemble convert floating-point octlets from double zero 


E.SINK.F.64.Z.O 


Ensemble convert floatinq-polnt octlets from double zero default 


E.SINK.F.128 


Ensemble convert floating-point hexiet from quad nearest default 


E.SINK.F.128.C 


Ensemble convert floating-point hexiet from quad ceilinq 


E.SINK.F.12B.C.0 


Ensemble convert floating-point hexiet from quad ceiling default 


E.S1NK.F.128.F 


Ensemble convert floating-point hexiet from quad floor 


E.SJNK.F.128.F.D 


Ensemble convert floating-point hexiet from quad floor default 


E.SINK.F.128.N 


Ensemble convert floating-point hexiet from Quad nearest 


ESINK.F.128.X 


Ensemble convert floatingpoint hexiet from Quad exact 


E.SINK.F.128.Z 


Ensemble convert floating-point hexiet from Quad zero 


ESINK.F.128.Z.O 


Ensemble convert floating-point hexiet from quad zero default 



FIG. 25A-1 
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Selection 





op 




round/ trap 


integer from float 


SINK 


16 32 64 128 


hoheCFNXZCJ) 
FJDZJ) 



Format 

E.SINK-F.prcc rnd rd=rc 
rd=csinkfprccnid(rc) 



31 



E.prec 

8 



24 23 



rd 



18 17 

m 



12 u 



6 5 



rc 
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Definition 

def ensemleSinkFloaUngPoint(prec,round,rd,rc) as 
c^-RegRead(rc, 128) 
for r*— 0 to 128-piec by prec 
ci-*— F(prec,Cj. f prec-i..i) 

endfor 

RegWriielrd. 128. a} 
enddef 



FIG. 25B 
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Exceptions 

Floating-point arithmetic 



FIG. 25C 
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del eb-*.ebits(prec) as 
case pref of 
16: 

32: 

eb-»-8 

64: 

eb-^11 

128: 

endcase 
enddef 

def eb-«- ebias(prec)as 

eb-*-0|| I9t>ii5(pf«c)-1 
enddef 

def fb-»- fbitsfprec) as 
fb-»-prec- 1 -eb 
enddef 

def a F(prec, ai) as 
a.s aipr^-t 
ae -*-aipr8c-2..fbi«s(pr«c) 

a* aiftiu{pfec)-i..o 
if ae = 1eWs(prec) then 

if af = 0 then 

a.t-«- INFINITY 

elseifaftoitefo^c^then 
a.t -*-SNaN 

a.e-*--ftits(prec) 

a.f "*-1||af arit s(prec).l..O 

else 

at -<-QNaN 

a.e-*--fbits(prec) 

a.f-»-af 

endif 
elseif ae = 0 then 
if af = 0 then 
a.t^-2ERO 



FIG. 25D-1 
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a.t-»- NORM 

a.e 1-ebias(pec).fbrls(prec) 
a.f — 0||af 

endif 

else 

a t -*-NORM 

a.e ae^bias(prec)-fbits(prec) 
a.f a# 
endif 
enddef 

def a -*-DEFAUtTQNAN as 

a.s-*-0 

a.t -*-QNAN 

a.e-*--1 

a.f-^l 
endder 

def a DEFAULTSNAN as 

a.s-*-0 

a.t— SNAN 

a.e-^-1 

a.f -*-l 
enddef 



FIG. 25D-2 
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del fadd(a.b) as faddr(a,b,N) endder 

def c -«-faddr(a,b,round) as 

if a.t'NORM and b.t=N0RM then 

// d.e are a,b with exponent aligned and fraction adjusted 
if a.e > b.e then 

d-«-a 

e.t-«-b.t 

e.s-*-b.s 

e.e-*-a.e 

e.f ^b.f||0 afl - ba 
else if a.e < b.e then 

d.t -*-a.t 

d.s -»-a.s 

d.e b.e 

d.t -»-a.f|| 0 b o a B 

e -*-b 

endif 
c.t -*-d.t 
c.e -«-d.e 
if d.s = e.s then 

c.s -*-d.s 

c.f -«-d.f + e.f 
elseif d.f > e.f then 

c.s -*-d.s 

c.f d.f - e.f 
elseif d.f < e.f then 

c.s -«-e.s 

c.f e.f- d.f 

else 

c.s««-rsF 
c.t -*-ZERO 
endif 



FIG. 25D-3 
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// priority is given to be operand for NaM propagation * 
eiseif (b.t=SNAN) or (b.t=QNAN) then 
c^-b 

eiseif (a.t=SNAN) or (a.t=QNAN) then 
c -^-a 

eiseif a.t^ZERO and b.t=ZERO then 
c.t^-ZERO 

c.s-*- (a.s and b.s) or (round=F and (a s or b.s)) 
// NULL values are like zero, but do not combine with ZERO to alter sign 
eiseif a.t=ZERO or a.t=NULL then 

c-*-b 

eiseif b t^ZERO or b.t=NULL then 

eiseif a.t«INFINITY and b. ^INFINITY then 
if a.s * b.s then 

c «*- DEFAULTSNAN // Invalid 

else 

c-*-a 

endif 

eiseif a.t=INFINlTY then 

c -*-a 
eiseif b.MNFINITY then 

c-^b 

else 

assert FALSE II should have covered all the cases above 

endif 
enddef 

def b-«-fneg(a)as 

b.s -as 

b.t-^a.t 

b.e-*-a.e 

b.f-^a.f 
enddef 

def fsub(a.b) as fsubr(a,b,N) enddef 

def fsubr(a,b,round) as faddrta,fneg(b),round) enddef 

def frsub(a,b) as frsub^a.b.N) enddef 

def frsubr(a,b,round) as faddr(fneg(a),b,round) enddef 



FIG. 25D-4 
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def c-«- fcom{a.b) as 

if (a.t-SNAN) or <a.t=QNAN) or {b.t=SNAN) or (b.t=QNAN) then 

elseif a.t=INFINITY and b.t=INFINITY then 
if a.s * b.s then 

c-*-{a.s=0)?G:L 

else 
endif 

elseif a.t=INFINITY then 

c-»- (a.s=0) ? G: L 
elseif b.t=INFINITY then 

c (b.s=0) ? L 
elseif a.t=NORM and b.t=NORM then 

if a.s * b.s then 

c^(a.s=0}?G.L 

efse 

if a.e > b.e then 
af-*-a.f 

bf^b.fllO 3 - 9 * 6 

else 

af-»- a.fl|O b *" 
bf^-b.f 
endif 

if af » bf then 
else 

c-*-((a.s=0)*(af>bf))?G:l 
endif \ 

endif 

elseif a.t=NORM then 

c^*-(a.s»0)?G: L 
elseif b.t=NORM then 

c-«-(b.$=0)?G:t 
elseif a.t=ZERO and b.t=ZERO then 

c-»-E 

else 

assert FALSE // should have covered al the cases above 
endif 

enddef 

FIG. 25D-5 
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def c -*-fmul(a,b) as ' 
if a.t=NORM and b.t=N0RM then 
c.s-«-a.s A b.s 
c.t-^NORM 
c.e a.© ♦ b.e 
c.f a.f * b.f 
// priority is given to b operand for NaN propagation 
elseif (b.t-SNAN) or (b.t-QNAN) then 
c.s-*-a.s A b.s 
c.t b.t 
c.e-*- be 
c.f b.f 
elseif (a.t=SNAN) or (a.t=QNAN) then 
c.s a.s * b.s 
c.t -«-a.t 
c.e -«-a.e 
c.f -»-a.f 
elseif a.t=ZERO and b.t=INFINITY then 

c DEFAULTSNAN // Invalid 
elseif a.t=INFINITY and b.t=ZERO then 

c -*-OEFAULTSNAN if Invalid 
elseif a.t=ZERO or b.t=ZERO then 
c.s -*-a.s *b.s 
c.t-*- ZERO 

else 

assert FALSE // should have covered al the cases above 
endif 
enddef 



FIG. 25D-6 
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def c fdtvr(a.b) as 

if a.t=N0RM and b.t=N0RM then 
c.s-*- a.s * b.s 
c.t -*-N0RM 
c.e a.e - b.e + 256 
c.( (a.f 0 ) / b.f 
// priority is given to b operand for NaN propagation 
elseif (b.t=SNAN) or (b.t=QNAN) then 
as-*- a.s A b.s 
c.t ■*- b.t 
c.e ■*- b.e 
c.f -*- b.f 
elseif (a.t=SNAN) or (a.t=QNAN) then 
as-*- a.s A b.s 
c.t -*- a t 
c.e-*- a.e 
c.f -*-a.f 
elseif a.t=ZERO and b.t=INFINITY then 

c -*- DEFAULTSNAN // Invalid 
elseif a.t=INFINITY and b.t=INFINITY then 

c -*- DEFAULTSNAN // Invalid 
elseif a.t=2ER0 then 
as-*- a.s * b.s 
c.t-*- ZERO 
elseif a.t=INFINITY then 
c.s-*-a.s * b.s 
at**- INFINITY 

else 

assert FALSE // should have covered al the cases above 
endif 
enddef 

def msb-*- findmsb(a) as 

MAXF*- 2 18 // Largest possible f value after matrix multiply 
forj-^OtoMAXF 

i'8MAXF-i..j '(^^llDthen 
msb-*-j 

endif 
endfor 
enddef 



FIG. 25D-7 
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Oef ai-»- PackF(prec,a.round) as 
case a.t of 

NORM: 

msb-*-findm$b(a.f) 

m msb-l-fbits(prec) //Isb for normal 

rdn— -ebias(prec)-a.e-1-fb)ts{prec) // 1sb if a denormal 

rb-»-(m>rdn)?rn:rdn 

if rb < 0 then 

aifr-*- a-fmsb-L oHO-* 
eadj-*-0 

else 

case round of 

C: 

s-»-O ms,Mb ||{-a.s)*> 

s^O msb -' b ||(a.s)rt> 
N, NONE: 

s ^. ff mb-rt>||, a .f rb j, a> }[M 

if a.f,b-i..o * 0 then 

raise FloatingPoinlArithmefic // Inexact 
endif 
s-^-0 

2: 

s-»-0 

endcase 

v^(0||a.f TOb .. () ) + (0||s) 
if vmsb=1 then 

aifr^-Vmsn..,,, 

eadj-*-0 

else 

aifr-»- tfbttstprec) 

eadj«*-1 
endif 
endif 

aien-4- a.e + msb - 1 ♦ eadj + ebias(prec) 
ifaien<Othen 

if round = NONE then 

ai-^a-sllO^^^Haifr 

else 

raise FloatingPointArithmetic //Underflow 



FIG. 25D-8 
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end if 

elseif aien > leMs(prec) then 

if round = NONE then 

//default: round-to-nearest overflow handling 
ai a.s|| t8Wls(prec) j| Qlbi»s(prec) 

else 

raise FloatingPointArithmetic // Overflow 
endif 

else 

ai a.s 1 1 aien e bii8(pf9c)-i..o 1 1 aifr 
endif 

SNAN: 

if round * NONE then 

raise FloatingPointArithmetic //Invalid 
endif 

if -a.e < fbits(prec) then 

ai ^-a-slll"" 8 ^*)!! a.f-a.«-i..oll o fbite <P r « ,c >* a - e 

else 

1st a.t-a.e-1-(bits(prBc)*l .0 *0 
ai -^a.s|l1«wte(pr6c)|| a .t a . e . 1 ... a . e . 1 . tt ,i te( p r#ch2 ||1sb 
endif 
QNAN: 

if -a.e < fbits(prec) then 

ai-»- a.s|| iBbit8(ptec)j| a f^^, 0 |j 0 fbiu(pr«eh«.» 

else 

1sb a.t-a.e-i-n»i$(pfBc)»i..o * 0 

ai — a.s||1^ft»«c)|| a .f.aL».i„.a.«-ifb>t5(precwlI1sb 

endif 

ZERO: 

ai-*- a s U o obl « 8 tt»~)|| o^is^ec) 

INFINITY: 

ai -•- a.S 1 1 1«bits<prec) 1 1 0 Wt«(p»ec) 

endcase 
defdef 
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Oef ai «*- fsinkr(prec. a, round) as 
case a.t of 

NORM: 

msb-«-findmsb(a.f) 
rb —- -a.e 
if rb i 0 then 

aifr-*-a.f msb .. 0 |[Q- fb 

aims msb - rb 

else 

case round of 
C.CO: 

s ^_ 0 msb-rb|| { , ais) r6 

F.F.O: 

s-*-0 msb - rb ||(a».s)rt> 
N, NONE: 

s^O^-^H-ai.frbllai.f^ 1 

X: 

if ai.frb-i..o*Othen 

raise FloatingPointArilhmetic // Inexact 
endif 

Z, ZD: 

s -»-Q 

endcase 

v^(0!la.W. 0 )-(0||s) 

if Vmsb=1 then 

aims msb ♦ 1 - rb 

else 

aims -4- msb - rb 
endif 

anV-*- Vaiing..rt> 
endif 

if aims > prec then 
case round of 

CD, F.D, NONE, Z.D: 
ai-^a.slK-asJPfw-i 

C.FXXZ 

raise FloatingPointArithmetic // Overflow 

endcase 
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W70 

elseif as = 0 then 
ai -»-aifr 

else 

ai-»-aifr 

endif 

ZERO: 

ai\*_ Qprac 

SNAN, QNAN: 
case round of 

CD, F.D, NONE. Z.D: 

ai-*- OP" 90 
C. F. N. X, 2: 

raise Floatingpoint Arithmetic // Invalid 

endcase 
INFINITY: 

case round of 

CD, F.D, NONE, Z.D: 

C. F, N, X. Z: 

raise FloatingPointArithmetic // Invalid 

endcase 

endcase 
enddef 



def c frecrest(a) as 
b.s-*-0 
b.t -»-NORM 
b.e-*-0 
b.f 

c ^fe$U1dhrlb,a)) 
enddef 

detc-»-frsqrest(a) as 
b.s-»-0 
b.t NORM 
b.e-*-0 
b.f -^1 

c fest(fsqr(fdiv(b,a))) 
enddef 
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— 2570 

def c -«-fest{a) as 
if (a.t=NORM) then 
msb-«-ltadmsb(a.f) 
a.e a.e + msb - 13 

a.f-^a.fin8b..m$b-i2|j 1 

else 

c-«-a 
endil 
enddef 

def-«-fsqr(a)as 

if (a.t=NORM) and (a.s=0) then 
c.s -*-0 
c.t-^NORM 
if (a.e 0 =1) then 

c.e-#-(a.e-127)/2 
c.f-»-sqr(a.f||0' 27 ) 

else 

ce-^(a.e-l28)/2 
c.f^-sqr(a.fll0^ a ) 
endif 

elseif (a.t=SNAN) or (a.t-QNAN) or a.t=ZERO or ((a.t=lNFIN)TY) and 
(a.s=0)) then 

c-*-a 

elseir ((a.t-NORM) or (a.tMNRNlTY)) and (a.s*1) then 
c -*-OEFAULTSNAN // Invalid 

else 

assert FALSE // should have covered al the cases above 
endif 
enddef 



FIG. 25D-12 



Case 2:05-cv-00505-TJW Document 129 Filed 09/1 2/2007 Page 27 of 
U.S. Patent Apr. 20, 2004 Sheet 104 of 148 US 6,725,356 B2 



Operation codes 



a Ann s 


uroup add bytes 


n Attn 1 a 


uroup add doublets 




Group add Quadlets 


ft inn <<t 


Group add octlcts 




Group add hexlet 


U.ADU.L.S 


Group add limit signed bytes 


G.ADD.L.16 


GroUD add Inn it cimi>rt Hrtnkf^tc 


G.ADD.L.32 


Group add limit signed quadlets 


G.ADD.L.64 


Group add limit signed octlets 


G.ADD.L.128 


Group add limit sipned hexlet 


G.ADD.L.U.8 


Group add limit unsigned bytes 


G.ADD.L.U.16 


Group add limit unsigned doublets 


G.ADD.L.U32 


Group add limit unsigned quadlets 


G.ADD.L.U64 


Group add limit unsigned octlets 


G.ADD.L.U.128 


Group add limit unsigned hexlet \ 


G.ADD.8.0 


Group add signed bytes check overflow 


G.ADD.16.0 


Group add signed doublets check overflow 


G.ADD.32.0 


Group add signed quadlets check overflow 


G.ADD.64.0 


Group add signed octlets check overflow 


G.ADD.128.0 


Group add signed hexlet check overflow 


G.ADD.U.8.0 


Group add unsigned bytes check overflow 


G-ADD,U.16.0 


Group add unsigned doublets check overflow 


G.ADD.U32.0 


Group add unsigned quadlets check overflow 


G.ADD.U.64.0 


Group add unsigned octlets check overflow 


G.ADD.U.128.0 


Group add unsigned hexlet check overflow 
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Format 

Gop.size rd=re,rb 
rd=gopaze(rc,rb) 

3] 24 23 18 17 12 li $j q 

I G -»™ 1 rd | rc 1 rb I ~ 1 

8 6 6 6 6 



FIG. 26B 
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def Group(op,size,rd > rc,rb) 
c «- RegRead(rc, 128) 
b<-RegRead(rb,128) 
case op of 
G.ADD: 

for i <- 0 to 1 28-size by size 

ai+sizc-L.i «- ci+ s ize-l J + bi+size-l..i 
endfor 
G.ADD.L: 

for i «- 0 to 1 28-size by size 

t <- (ci+size-1 II ci+size-l.a) + (bi+size-1 11 bi+size-l.i) 

ai+size-l. i <~ (tsize * t s ize-l) ? (tsize I t§gg:|) : tsize-1 .0 
endfor 
G.ADD.L.U: 

for i <- 0 to 1 28-size by size 

t <- (0l || ci+size-1..0 + (0 1 II bRsize-l..i) 

ai+size-l..i «- (tsize * 0) ? (isize) : t s i 2e -1..0 
endfor 
G.ADD.O: 

for i «- 0 to 128-size by size 

t «- (ci+size-1 II ci+size-l..i) + (bH-size-1 II bi+size-L.i) 

if tsize ^ tsize-1 then 

raise FixedPointArithmetic 
endif 

ai+size-l..i «- tsize-1. .0 
endfor 
QADD.U.O 

for i 0 to 128-size by size 

t*-(0l Oci+size-1 .0-^(01 (Ibi+sizc-l.i) 

if tsize ^0 then 

raise FixedPointArithmetic 
endif 

ai+$ize-l..i<- t s ize-L .0 
endfor 

endcase 

RegWrite(rd,]28,a) 
enddef 
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Operation codes 



G.SET.ANDE.8 1 


\Jt uup scl aiui equal Zero DylcS 


G.SET.ANDE.16 




G.SET.AND.E.32 


vjioup sci aoo equal zero quauiets 


G SET AND E 64 


vjroup set ana equal zero octlets 


G.SET AND F lift 


Group set and equal zero hcxJct 


G SET AND NE 8 


1 tfAltA CP"fr" ^rts4 j kl - tt »l -> M -i p, t- - 


G.SET. AND.NE. 16 




G.SET -AND.NE.32 


fimin cpf an/1 nr*f ^ata /v>>^^Im«» 
vjivujj XI ouu IKH f ^fp*r 1B 2CrO (fU3UiCvS 


G.SET .AND.NE.64 


vj«uup sci ana not zero ocuets 


G.SET AND NE 128 


Group set and not equal zero hexlet 


O ^FTF 8 

w.OCl .CO 


Group set equal bytes 


G SETF 16 


Group set equal doublets 




Group set equal quadlets 


1 X.vH 


Group set equal octkts 


VJ.JlLJ 


Group set equal hexlet 


G SET GF 8 


Group set greater equal signed bytes 




Group set greater equal signed doublets 


G.SET GE 32 


Group set greater equal signed quadJets 


G 9FT OF 64 


Group set greater equal signed octlets 


G SETGE 12* 

VJ.kJl^ « .VJ1_>. 1 x.J 


Group set greater equal signed hexlet 


G.SET.GE.U.8 


vjioup set greater equal unsigned Oytes 


G.SET.GE.U16 


vjiuup sci greater equal unsigneu Clou Diets 


G5ET.GE.U.32 


vjiuup be*, gicdicr cquai unsigneo quauiets 


G-SET.GE.U.64 


Grotio set crreater cnua! iincioiteH nrt l^tc 


G.SET.GE.U.128 


vjiuup act gjc^ici equal unsifyxcu neXJCT 


G.SET.L.8 


Group set signed less bytes ^ 


G^ET.L.16 


Group set signed kss doublets 


G.SET.L.32 


Group set signed less quadlets 


G.SET.L.64 


Group set signed less octlets 


G.SET.L.128 


Group set signed less hexlet | 


G.SET.L.U.8 


Group set less unsigned bytes 


G.SETX.U.16 


Group set less unsigned doublets 


GJSETLU.32 


Group set less unsigned quadlets 


CSETX.U.64 


Group set less unsigned octlets 


CSETXU.128 


Group set less unsigned hexlet 


G.SET.NE.8 


Group set not equal bytes 


G.SET.NEJ6 


Group set not equal doublets 


G.SET.NE.32 


Group set not equal quadlets 


G.SET.NE64 


Group set not equal octlets 


OSETNE128 


Group set not equal hexlet 


O.SUB.8 


Group subtract bytes 


&SUB.8.0 


Group subtract signed bytes check overflow 
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G.SUB.16 


Group subtract doublets " 


G.SUB.16.0 


Group subtract signed doublets check overflow 


GSUB.32 


Group subtract quadtets 


G.SUB.32.0 


Group subtract signed quadlets check overflow 


G.SUB.64 


Group subtract octiets 


G.SUB.64.0 


Group subtract signed octiets check overflow 


G SUB. 128 


Group subtract hexkt 1 


G SUB. 128 0 


Group subtract signed hexkt check overflow 


G.SUB.L.8 


Group subtract limit signed bytes 


G.SUB.L.16 


Group subtract limit signed doublets 


G.SUB.L.32 


Group subtract limit signed quadlcts 


G.SUBX.64 


Group subtract limit signed octiets 


G.SUB.L 128 


Group subtract limit signed hexlet 


G.SUB.L.U.8 


Group subtract limit unsigned bytes 


G.SUB.L.U.16 


Group subtract limit unsigned doublets 


G.SUBi.U.32 


Group subtract limit unsigned quadlcts 


GSW.IVM 


Group subtract limit unsigned octiets 


G.SUB.L.U.128 


Group subtract limit unsigned hexlet 


G.SUB.U.8.0 


Group subtract unsigned bytes check overflow 


G.SUB.U.16.0 


Group subtract unsigned doublets check overflow 


G.SUB.U.32.0 


Group subtract unsigned quadlets check overflow 


G.SUB.U.64.0 


Group subtract unsigned octiets check overflow 


G.SUB.U.I28.0 


Group subtract unsigned hexlet check overflow 
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Format 

Gop.nze rd=ib,rc 
rd=gopaze(rb t rc) 

I Cjizc 1 rd I rc 1 rb 1 op I 

8 6 6 6 6 
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Definition 

dcf GroupReversed(op,size jd,rc»rb) 
c 4- RegRead(rc, 128) 
b«-RcgRead(rb,128) 
case op of 
G.SUB: 

for i 4- 0 to 128-aze by size 

ai+size-l.J «- bi+size-U - c i+sizc-l-i 
endfor 
G.SUB. L: 

for i 4- 0 to 128-size by size 

t *~ (*>i-t-si2C-i U t>i+*ue-l..i) - fa+sizc-l II Cr+sae-l.i) 

ai+size-i.A *~ (tsize * tsize-l) ? Osize U tfi5£l) t»ze-l ..0 
endfor 
G.SUB J-U: 

for i 4- 0 to 128-size by size 

t *- (0* || bi+si^i. .0 - (0* I cj +s i 2C .l..i) 

ai+size-1 ..i «- (tsize * 0) ? OS* 2 * X^l.O 
eodfbr 
G.SUBO 

for i +- 0 to 1 2 8 -size by size 

t 4- (bi+size-1 Hbi+sue-Ka) - (ci+size-i II cj+size-l..i) 

if (tsize *tsize-l)to» 

raise FixedPouuAriUirnctic 

endif 

ai+size-l..i «- tsize-1.,0 
endfor 
G.SUB.U.O: 

for i «- 0 to 128-size by size 

t*-^ 1 [|bHsize.l.0-(0 1 Bci +s i zc .i.i) 
if Osize * 0) thco 

raise FixedPointAhthmetic 

endif 

ai+size-I..i «- tsize*!. .0 
cndfbr 
G.SET.E: 

for i <- 0 to 12&-size by size 

ai+size-1 ..i «- (bt+siae-l .i = ci^i^i^i)^ 
endfor 
G.SET.NE: 

fori 4-0 to 128-sizcby size 

ai+$ize-l..i «- 0>i+$ize-I..i * ci+size-U) 5126 
eodfbr 
G.SET.AND.E: 

fori«-0to 128-size by size 

ai-Hize-l. i 4- ((bt+size-L.i and cj+size-l .1) = 0)»» 
endfor 

FIG. 27C-1 



Case 2:05-cv-00505-TJW Document 1 29 Filed 09/1 2/2007 Page 34 of 40 
U.S. Patent Apr. 20, 2004 Sheet 111 of 148 US 6,725,356 B2 



G.SFTANDNE: 

fori«-0to 128-sizeby size 

cndfor 
G.SET.L: 

fbri«-0to 128-sizc by size 

ai+sze-IJ «- <(rc = rb) ? (bMie-l.j < 0) : <bt+size-L.i < ci+ s ize-l..D) si2C 
endfor 
G.SET.GE: 

for i *- 0 to L2*-size by size 

ai+»zc-U «- ((rc = rb) ? (bfrsize-l.j £ 0> ; (bHsizci.i £ ci+size-l. Sff* 2 * 
endfor 
G.SET.L.U: 

for i <- 0 to 12B-sizc by size 

ai+size-l .i 4- ((rc = rb)? (bi+size-U > 0) : 

((0 1 bi+size-l.i) < (0 1 ci+size.1 J)))*** 

endfor 
G.SET.GE.U: 

for i 0 to 1 28-size by size 

ai-*size-t..i *- ((rc = rb) ? (bj+sjje-U S 0) : 
(CO § bi+^ze-l .i) * {0 B Ci+size-L.i))) 5 ^ 

endfor 

endcase 

RegWritcfrd, 128,3) 
enddef 
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Operation cmfaa 



E.CON.8 


Ensemble convolve signed bytes 


E.CON.16 


Ensemble convolve signed doublets 


E.CON32 


Ensemble convolve signed quadlets 


E.CON.64 


Ensemble convolve signed octlets 


E.CON.C.8 


Ensemble convolve complex bytes 


E.C0KC.I6 


Ensemble convolve complex doublets 


E.CON.C.32 


Ensemble convolve complex quadlets 


E.CONM8 


Ensemble convolve mixed-signed bytes 


E.CON.M.16 


Ensemble convolve mixed-signed doublets 


E.COKM.32 


Ensemble convolve mixed-signed quadlets 


E.CON.M.64 


Ensemble convolve mhtcd-^iened neilete 


E.C0N.U.8 


Ensemble convolve invii&ned hvt^c 


E.CON.U.16 


Ensemble convolve unsigned doublets 


E.CON.U.32 


Ensemble convolve unsigned Quadlets 


E.CON.U.64 


Ensemble convolve unsigned octlets 


E.DIV.64 


Ensemble divide signed octlets 


E DPV U64 


Ensemble divide unsienned I'vttlaf* 


E.MUL.S 


Ensemble multiply signed bytes 


EMUL.16 


Ensemble multiply signed doublets 


E.MUL.32 


Ensemble multiply signed quadlets 


EJMUL.64 


Ensemble multiply signed octlets 


E.MULSUM.8 


Ensemble multiply sum signed bytes 


E.MUL.SUM.16 


Ensemble multiply sum signed doublets 


EMUL.SUM.32 


Ensemble multiply sum signed quadlets 


E.MUUSUM.64 


Ensemble multiply sum signed octlets 


EMUL.C.8 


Ensemble complex multiply bytes 


EJ4UL.C.I6 


Ensemble complex multiply doublets 


E.MULC32 


Ensemble complex multiply quadlets 


EJvfUL-M 8 


Ensemble multiply mixed-signed bytes 


E.MUL.M.16 


Ensemble multiply mixed-signed doublets 


E.MUL.M.32 


Ensemble multiply mixed-signed quadlets 


EMUL.M.64 


Ensemble multiply mixed-signed octlets 


E.MUL.P.S 


Ensemble multiply polynomial bytes 


EMUL.P.I6 


Ensemble multiply polynomial doublets 


E.MUL*,32 


Ensemble multiply polynomial Quadlets 


E.MUL.P.64 


Ensemble multiply polynomial octlets 


E.MUL.SUMX 8 




EMUL.SUM.C 16 


Ensemble multiply sum complex doublets 


E.MUL.SUM.C.32 


Ensemble multiply sun complex quadlets 


EMUL.SUM.M.8 


Ensemble multiply sum mixed-signed bytes 


EMUL.SUM.M16 


Ensemble multiply sum mixed-signed doublets 


E MUL.SUM.M.32 


Ensemble multiply sum mixed-signed quadlets 


EMUL SUM.M.64 


Ensemble multiply sum mixed-signed octlets 
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E.MUL.SUM.U.8 


Ensemble multiply sum unsigned bytes 


E.MUL.SUM.U 16 


Ensemble multiply sum unsigned doublets 


E.MUL.SUM.U32 


Ensemble multiply sum unsigned (juadlets 


EMULSIMU64 


Ensemble multiply sum unsigned oetlets 


E.MUL.U.8 


Ensemble multiply unsigned bytes 


E.MUL.U.16 


Ensemble multiply unsigned doublets 


E.MUL.UJ2 


Ensemble multiply unsigned quadlets 


E.MUL.U.64 


Ensemble multiply unsigned octlets 
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Format 

Eop.size rd=rc,rb 
rd=copazc(rc,rb) 

31 24 23 1817 1211 65 

I ELsca I rd | rc | rb | 
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Definition 

dcf imiKsize,h,vs,v,t > ws > wJ) as 

rouJ «- ((vs&v suc . 1+1 )b-s^ J vsizc-n-ij) • <(ws&w s i»-l4j) h - s ^ 0 
enddef 

def c 4- PoJyMultipIy(sircva.b) as 

p[0J 4- O 2 *^ 

for k 4- 0 to size-1 

p[»cf 1] 4- p(kl * ajc ? a b 1 0 k > : 0 2 ' si2c 

endfbr 

c 4- p[size] 
eoddcf 

def Eiisemble(op,size,nt,rc,rb) 
c «- RcgRcatKrc, 128) 
b<-RegRcad(ib,128) 
case op of 

E.MUL:, E.MUL.C:, EMUL.SUM, E MUL.SUM.C. E CON, E.CON.C, E.DJV: 
EMUL.M:,EMUL.SUM.M, E.CON.M: 

CS 4- 0 

bs 1 

E.MUL.U:, EMUL.SUM.U, E.CON.U, E.DIV.U, E.MUL.P: 

CS 4- DS4-0 

endcase 
sop of 

E.MUL, E.MUL. U, E.MUL.M: 
foi i 4- 0 to 64-si2e by size 

<*2»(t+sizeH .2** «- muJ(si2e,2*si2e,cs,c J i,bs,b > i) 
endfbr 
E.MUL.P: 

for i <- 0 to 64-sizc by size 

<J2n^sizc)-l..2»i «- PoJyMuJtiply(sizc,c 5 i zc .i+i.^b S i ZC .i+i < .i) 
endfbr 
E.MUL.C: 

for i 4- 0 to 64-sizc by size 
if (i and size) * 0 then 

else 

p 4- muJ{si2c^*su^],cJLI>.rtsize) + mul(size^*size > l t CA I»size) 

endif 

42*(rfsi2cM JS«-P 
eodfoT 

EJUUL.SUM, E.MUL.SUM.U, E.MULSUMM: 
p{0]4-0>28 

for i 4- 0 to US-size by size 

pLi+size] 4- p(il^miil(size,l28,cs^bs > b^) 
endfbr 
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a<-p(128] 
EJ4UL.SUM.C 

p[0] <- 

pfsizej <- 0 64 

for i «- 0 to 128-size by size 
if (i and sire) = 0 then 

pfi+2*sizej <- p[i] + muI(sizc64,I f c f U,b,i) 

• mul(sne > 64,l,c,i+si2c > l,b > i+sizc) 

else 

pfi+2*size) «- pfi] + mul{$l2c > 64,l > cj,i,b t i+size) 
+ mul(size*MXc,i+size, l,b,i) 

encfif 
endfor 

a«-p[l28+stze]Bp[128] 

E.CON, ECON.U, E.C0N.M: 
p{0] <-0l28 

for j 4- 0 to 64-size by size 

for 1 4- 0 to 64-size by size 

mul(size,2*sizE > cs,c,i+64 : j,bs > bj) 

endfor 
endfor 
a<-pI64] 

E.CON.C: 

p[0] 4- 0128 

for j <- 0 to 64-size by size 

for i 4- 0 to 64-size by size 

if (H) and j and size) = 0 then 

Ptr^s«el2*(i+sizehl.^*i <- rf]2*(i+si2e)-l..2«i + 

else 

Pfr^«el2»(i + s«e)-l ,2*i «- PliJ2*(rfsize)-L.2*i - 
muI(size^^I >c> ^64-i+2*size 4 I,bj) 

endif 
endfor 
endfor 
a^p[64] 
E.DIV: 

if (b = 0) or ( (c * ( 1 10«)) and (b = 1 64) ) then 
a 4- undefined 
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r •*— c - q*b 
a<-r63..0ll<j63..0 

endif 
E.DIV.U: 

if b = 0 then 

a <— undefined 

else 

q«-(0|jc)/(0||b) 

r4-c-(0||q)»(0||b) 

a<-r63..0llq63..0 

endif 

endcase 

RegWrite(rd. 128, a) 
enddef 
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Floating-point function Definitions 

def eb «- ebits(prec) as 
case pref of 
16: 

eb<-5 

32: 

eb<-8 

64: 

eb<-ll 

128: 

eb<-15 

endcase 
cnddef 

dcf eb <- cbias(prcc) as 

cb^-OB icbits(prec)-l 
cnddef 

def fb 4- fbits(prec) as 

fb prcc - I - cb 
cnddef 

def a «- F(prec, ai) as 
a.s«-ai prec .i 

ac<- ai p rec-2..fbits(prec) 
af<-aifbits(prec)-1..0 

ifae = l*te(prec) then 

ifaf=0then 

a.i+- INFINITY 

elseifafft)its(picc)-ltheo 
a.t«-SNaN 
a.e «- -fbits(prec) 
a,f^-L||affbits(pitc).2..0 

else 

a.t*-QNaN 
a.e<--fbits(prec) 
a.f «-af 

endif 
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else if ac "= O then 
ifaf «0then 

at <- ZERO 

else 

at <- NORM 

a.e l-ebias(prcc>-£bits(prec) 
a.f*-0|)af 

endif 

else 

at «- NORM 

a.c <- ae-ebias(prec)-fb!ts(prec) 
af*-l||af 

endif 
enddef 

def a *- DEFAULTQNAN as 

a.s «- 0 

a.t «- QNAN 

a.e*- -1 

a.f 4- 1 
cnddef 

def a DEFAULTSNAN as 
a.s<-0 
a t <- SNAN 
a.c 

a.f<- 1 
cnddef 

def fadd(a,b) as faddrta^N) cnddef 

def c 4- faddr(a,b,round) as 

if a.r-NORM and b.t=NORM then 

// d,c arc a,b with exponent aligned and fraction adjusted 
if a.e > b.c then 
d a 
e.t*-b.t 
e.s «- b.s 
e.e 4- a.e 

cft-b-fBO*- 0 ^ 6 
dse if a.e < b.e then 
d.t 4- a.t 
ds«-a.s 
d.e «- b.e 

dit-aJBO*^* 
c<- b 
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cndif 
c.t <- At 
c.c «- d.e 
if ds = e.s then 

c.s 4- d.s 

c.F«-d.f+e.f 
dscifd.f>c.fthcn 

c.s<- d.s 

cffdi-cf 
dseifd.f<e\fthen 

c.s «- c.s 

c.f^e.f~d.f 

else 

c.s <- r=F 
c.t «- ZERO 

endif 

// priority is given to b operand for NaN propagation 
eiseif (b*=SNAN) or (b.t=QNAN) then 
c «- b 

eiseif (a.r~SNAN) or (a.t^QNAN) then 
c «- a 

dscif a t=*2ER0 and b,t-ZER0 then 
c.t «- ZERO 

c.s «- {a.s and b.s) or (round^F and (a.s or b.s)) 
ff NULL values axe like zero, but do not combine with ZERO to alter sign 
eiseif a.t=ZERO or a.t=NULL then 

c 4- b 

eiseif b t=ZERO or b t=NULL then 
c a 

eiseif a.t=*INFINlTY and bf INFINITY then 
ifa-s * b.s then 

c <- DEFAULTSNAN // Invalid 

else 

c«- a 

cndif 

eiseif a. t=INFINTTY then 
c*- a 

eiseif b.t=INFINITY then 
c<-b 

else 

assert FALSE // should have covered al the cases above 

cndif 
eoddef 

def b 4- tntg(a) as 
b.s *- -a.s 
bj4-a.t 
be *- a.c 
b.f*-*f 
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dcf ftubr(a,b,round) as faddr(a,fneg(b), round) enddcf 

def frsubfob) as foubr(a,b,N) enddef 

def frsubrfob.roiind) as faddr(foeg(a),b,round) enddef 

def c 4- fcomfob) as 

if (a.t=SNAN) or (a.t-QNAN) or (b.t=SNAN) or (b.c=QNAN) then 

elseif at=INFIN]TY and b.t 52 INFINITY then 
ifas*b.s then 

c «- (a.s=0) ? G: L 

else 

c«-E 

endif 

elseif a ^INFINITY then 

c«-(a.s=0)?G: L 
elseif bflNFINITY then 

c «- (b.s=0) ? G: L 
elseif a.t=NORM and b.t=NORM then 

ifa.s^b.s then 

c <- (a,s-0> ? G: L 

else 

if a.e > b.e then 
af «- aT 

bf^-b/flOa*-^ 

else 

af*-a.fB0^a.« 
bf<-b.f 

endif 

ifaf-bfthen 

C4-E 

else 

c «- ((a.s=0) A (af > bf)) ? G : L 

endif 

endif 

elseif a,t=NORM then 

c*-(a.sN))?G:L 
dsafbr-NORM then 

c +- (b.s^O) ? G: L 
dscif a.t=ZERO and b.t=2ER0 then 

C4-E 

else 

assert FALSE // should have covered al the cases above 

endif 
enddcf 
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dcf c «- fimil(a,b) as 

if a.t=NORM and b.t=NORM then 

c.s«-a.s A b.s 

c.t NORM 

c.c *- a.c + b.e 

c.F4-a.f •b.f 
H priority is given to b operand for NaN propagation 
elseif (b.t=*SNAN) or (b.t=QNAN) then 

C.S MS A b.s 

c.t <- b.t 

c.e b.e 

c.f<~b.f 

eJseif (a.t^SNAN) or (a.t=QNAN) then 
c.s «- a s A b.s 
c.t «- a.t 
c.c «- ax 
c.f a.f 

elseif a.t=ZERO and bt=INF!NITY then 

c DEFAULTSNAN // Invalid 
eiseif a.t=INFINlTY and b.t=ZERO then 

c «- DEFAULTSNAN // Invalid 
elseif a.c=ZERO or b.c=ZERO then 

c.s «- a.s A bs 

c.t <- ZERO 

else 

assert FALSE // should have covered aJ the cases above 

endif 
enddef 

def c *-fdivr(a,b)as 

if a.t=NORM and b.t=NORM then 
c.s «- a.s A b.s 
c.t «- NORM 
c.e a,e - b.e + 256 

// priority is given to b operand for NaN propagation 
cbctf(b.t=SNAN) or (b.t-QNAN) then 

c.s«-a.s A b.s 

c.t «- b.t 

c.e «- b.e 

cl<-b.f 

dseif (afSNAN) or (a.MjNAN) then 
c_s «- as A b.s 
c.t «- a.t 
c.e «- a.e 
ci«-af 
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elseif a.t=ZERO and b.t=ZERO then 

c «- DEFAULTSNAN // Invalid 
elseif a.t=fNFINmf and b.t^INFINITY then 

c <- DEFAULTSNAN // Invalid 
elseif a. t=ZERO then 

c.s «- a.s A b.s 

c.t <- ZERO 
elseif a.t=INFINITY then 

c.s «- a.s A b.s 

c.t «- INFINITY 

else 

assert FALSE // should have covered al the cases above 

endif 
enddef 

def msb +- findmsb(a) as 

MAXF «- 2 1 * // Largest possible f value after matrix multiply 
forj«-0toMAXF 

if aMAXF-l.j = (OW^^H | 1) then 
msb j 

endif 
endfor 
enddef 

def ai «— PackF(prec,a,round) as 
case a t of 
NORM: 

msb 4- findmsb(a.f) 

m <- msb-l-fbits(prec) // lsb for normal 

rdn <- -ebias(prec)-a. e- 1 -fbits(prec) // lsb if a denormal 

rb «- (rn > rdn) ? rn : rdn 
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if rb S 0 then 

aifr4-a.f ms b.I..0B^ rb 
eadj«-0 

else 

case round of 

C: 

s <_ omsb-rb || i^ s yb 

F: 

s 4~ omsb-rb j ( a . s yrb 
N, NONE: 

5 «_ omsb-rb 

X; 

ifa.ffb.1 .0*0 then 

raise FloatingPointArithmctic If Inexact 

endif 
s«-0 

Z: 

s ^ 0 

endcase 

v^(O8a.f ms b. O ) + (0||s) 
tfv ms b= 1 then 

eadj «- 0 

else 

aifr ()fbits(prec) 
eadj «- 1 

endif 

endif 

aien 4- ax + msb - L + eadj 4- ebias(prec) 
if aien <, 0 then 

if round = NONE then 

ai a.$ fl 0^te(prec) | ^ 

ebe 

raise RoatingPomtArithmetic //Underflow 

endif 

ebeifaks £ iebit$(prec) ^ 
if round = NONE then 

//defeurc round-to-ocaxest overflow handling 
ai +- a.s B icbits(prec) g ofbtts(prec) 

else 

raise Floao^ointAntronetic //Underflow 

endif 

else 

ai «- a.s J «encbits(prcc)-l .0 1 aifr 

endif 
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SNAN: 

if round* NONE then 

raise FkxaiingFointAriihiricric //Invalid 

endif 

if -a.e < fbits(prcc) then 

ai «- a s I) ]ebits(prcc) y a.f. a . c .iJ) II 0 ftil *(pra}+a e 

else 

Isb <- a -f-a.e-l-ftits(prec)+I..O * 0 

ai *" a - s H iCbitS(pftc) 0 a f-a.e-1 .^e-l-fbits(prec)+2 II Isb 

endif 
QNAN: 

if -a.e < ft>its(piec) then 

ai «- a.s || i^(prcc) j aXa.e-I.,0 (| o«>its(prec^a.e 

else 

Isb <- a.La.e-l-fbits(prec)+i..O * 0 

ai <- as | iebits(prec) , a.f. a .e.l..^e-l.fbit$(prec)+2 II Isb 

endif 
ZERO: 

ai «- a.s | oebits(prtc) g Qfbits(prcc) 
INFINITY: 

ai <-a.s| iebiis(prec) p 0 fbits(prec) 

end case 
defdef 

def ai fsinkr(prcc, a, round) as 
case a.t of 
NORM: 

msb «- findmsb(a.f) 
rb 4- -a.e 
ifrbSOthen 

aifr<-a.fmsb..O II 0"* 
aims «- msb - rb 

else 

case round of 
CCD: 

s «- O^b-ib g (^ai s )rb 
F,FJ>: 

s *- on»b*rt> u (aj s jrb 
N, NONE: 

s «- Orob-* fl ^.fo fi aiijjf-1 

X: 

ifaifrb-l. 0^0 then 

raise FloatmgPointArithmetic // Inexact 

endif 
s«-0 
2, ZD: 
s«-0 
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endcase 

v^-(01ta.f insb ^) + (0|ls) 
if Vmsb = 1 then 

aims <- msb + I - rb 

else 

aims «- msb - rb 

endif 

aifr «- vaims..fb 

endif 

if aims > prec then 
case round of 

C D, F D, NONE, Z D: 
ai<-a.si!(~as)PK*-I 



CF,N,X,2: 

raise FloatingPointAiithmetic if Overflow 

endcase 
elseifa.s = 0then 
ai 4- atfir 

else 

ai < — aifir 

endif 
ZERO: 

ai «_ 0P rec 
SNAN, QNAN: 
case round of 

CD, F D, NONE, Z D: 

ai <- 0P ICC 
C,F,N, X, 2: 

raise RcatingPoiniArichmetic // Invalid 

endcase 
INFINITY: 

case round of 

CD, FD. NONE, ZD: 

ai «- « I (~as)P«c-l 
C,F,N,X.Z: 

raise FkxttmgPointArithinetic // Invalid 

endcase 

endcase 



def c <- frecrest(a) as 
b\s«-0 
b.t «- NORM 
bjt<-Q 
bl<- I 

c <- fest<njiv(b^)) 
en ddef 
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def c 4- frsqrest(a) as 
b.s«-0 
b.t 4- NORM 
bx<- 0 
bJ4-l 

c^festCfeq^fdivC^a))) 
enddef 

def c<- fcst(a)as 

if (a.t=NORM) then 

msb «- findmsb(a.f) 
a.c 4- a.c + msb - 13 
a.f4-a.f ms b.jnsb-!2ll 1 

else 

c 4- a 

endif 
enddef 

def c 4- fsqr(a) as 

if (a.t=N0RM) and (a s=0) then 
c.s«- 0 
c.t NORM 
if (a.c0= I) then 

c.e4-(a.e-127)/2 
c.f<-sqr(a.f|0 127 ) 

else 

c.e4-(a«-128)/2 
c.f4-sqr(a.f||0 ,2 S) 

endif 

elseif (a.t=SNAN) or (a.t=QNAN) or a.t=ZERO or ((a t-INFINITY) and (a.s=0)) then 
c 4- a 

elsetf ((a.t=NORM) or (a.t=INFINITY)) and (a.s=l) then 
c 4- DEFAULTSNAN // Invalid 

else 

assert FALSE// should have covered a! the cases above 

endif 
enddef 
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Operation codes 



v Ann f 1 


Ensemble add floating-point half 


f Ann f 1 fi r 


ensemble add floating-point half ceiling 


F Ann F 1 F 


Ensemble add floating-point half floor 


F Ann F Ifi M 


Ensemble add floating-point half nearest 


f Ann F \(\ V 


ensemble add floating-point half exact 


f Ann f 1^ 7 


Ensemble add floating-point half zero ! 


f Ann f t> 


ensemble add floating-point single 


f Ann cmp 


Ensemble add floating-point single ceiling 


f Ann f f 


ensemble add floating-point single floor 


f Ann piom 


ensemble add floating-point single nearest ] 




Ensemble add floating-point single exact 


T? Ann IT 10 "7 


tnsemWe add floating-point single zero 


f Ann f /vd 


ensemble add floating-point double 


F Ann pAar 


ensemble add floating-point double ceiling 


F Ann V fJLV 


ensemble add floating-point double floor 




Ensemble add floating-point double nearest 


f Ann c Ait v 


Ensemble add floating-point double exact 


e, tWJU.r .\r\ .Z. 


Ensemble add floating-point double zero 


f a nn c 1 


Ensemble add floating-point quad 


f Ann F 1951 C 


Ensemble add floating-point quad ceiling 


E.ADDF.128JF 


Ensemble add floating-point quad floor 


E.ADD.F.I28.N 


Ensemble add floating-point quad nearest | 


E ADD F. 128 X 


Ensemble add floating-point quad exact 


EADD.F.128.Z 


Ensemble add floating-point quad zero 


E.DIV.F.16 


Ensemble divide floating-point half 


EDIV.F.16.C j 


Ensemble divide floating-point half ceiling 


E DIVE 16 F 


Ensemble divide floating-point half floor 


EDIV.F.16.N 


Ensemble divide floating-point half nearest 


E.DIV.F.I6.X 


Ensemble divide floating-point half exact 


EDIV.F.16 2 


Ensemble divide floating-point half zero 


EDIV.F.32 


Ensemble divide floating-point single 


E.DIV,F32.C 


Ensemble divide floating-point single ceiling 


EDIV.F.32.F 


Ensemble divide floating-point single floor 


EDIVF32N 


Ensemble divide floating-point single nearest 


EDIV.F.32X 


Ensemble divide floating-point single exact 


EDIV.F.32.Z 


Ensemble divide floating-point single zero 


EDIV.F.64 


Ensemble divide floating-point double 
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F HIV F f*A c 


Cnsemblc divide floating-point double ceiling 


cutv.r.o4.r 


Ensemble divide floating-point double floor 


cui v .r.o*i.iN 


Ensemble divide floating-point double nearest 


C.JJiV.r.04.A 


Ensemble divide floating-point double exact 


fc..UlV.r.04.£ 


Ensemble divide floating-point double zero 


fc.LHV.r.128 


Ensemble divide floating-point quad 


c.L/lv J*. 12©. C 


Ensemble divide floating-point quad ceiling 


fc.Dlv.F. I28.F 


Ensemble divide floating-point quad floor 


C TMW 1? l^Q XT 

fcJJlV.r. 125. N 


Ensemble divide floating-point quad nearest 


E.DIV.F. 128.X 


Ensemble divide floating-point quad exact ~* 


c.LHV.r.l28.Z 


Ensemble divide floating-point quad zero 


fc.MuL.Cr. 16 


Ensemble multiply complex floating-point half 


E.MUL.C.F.32 


Ensemble multiply complex floating-point single 


E~MUL.O.F.64 


Ensemble multiply complex floating-point double 


RMUL.F.16 


Ensemble multiply floating-point half 


E.MUL.F.16.C 


Ensemble mutopiy floating-point half ceiling 


EJ4UL.F.16.F 


Ensemble multiply floating-point half floor 


E.MUL.F. 1 6.N 


Ensemble multiply floating-point half nearest ~* 


f— » « ry f- ■ /■ XT' 

E.MULJM6.X 


Ensemble multiply floating-point half exact 


fc.MULj\JO.Z 


Ensemble multiply floating-point half zero 


1? \ /ft n r Ti 

Jb.MUU.r.32 


Ensemble multiply floating-point single 


Jb.MUL.r.32.C 


Ensemble multiply floating-point single ceiling 


EMUL.F32F 


Ensemble muttiolv float ino-nnmt <moif» fir*\* 


EMUL.F.32.N 


Ensemble multiply floating-point single nearest 


RMUL.F32.X 


Ensemble multiply floating-point single exact 


E.MUL.F.32.2 


Ensemble multiply floating-point single zero 


E.MUL.F.64 


Ensemble multiply floating-point double 


RMUL.F.64.C 


Ensemble multiply floating-point double ceiling 


E.MUL.F.64.F 


Ensemble multiply floating-point double floor 


EMULF.64.N 


Ensemble multiply floating-point double nearest 


E.MUL.F.64.X 


Ensemble multiply floating-point double exact 


E.MULF.64Z 


Ensemble multiply floating-point double zero 


E.MUL.F.I28 


Ensemble multiply floating-point quad 


EJvfULJM28.C 


Ensemble multiply floating-pout quad ceiling 


EMUL.F.I28.F 


fcnsemble multiply floating-point quad fV*w 


E.MULF.I28.N 


Ensemble multiply floating-point quad nearest 


RMUL.F.I28.X 


Ensemble multiply floating-pomt quad exact 


E.MUL.F.128Z 


Ensemble multiply floating-point quad zero 
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Selection 



class 


op 


prec 


round/trap 


add 


EADDF 


16 32 64 128 


woneCFNXZ 


divide 


EDIVF 


16 32 64 128 


noweCFNXZ 


multiply ^ 


EMULF 


16 32 64 128 


noneCFNXZ 


complex multiply 


EMUL.CF 


16 32 64 


NONE 



Format 



E. op prec. round rd-rc,rb 



rd=eopprecround(rc > rb) 

31 24 23 18 17 

T £>prec 1 rd \ 



re 



12 II 



6 5 



rb I op.round | 
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Definition 

dcf mul(size/vXw j) as 

mul 4- faui^sizcvsize-l+i ..0,F(««,wsize-l+j . j)) 
enddef 

dcf EnsembldFloatingPoint(op,pi€C v round,ia v rb t rc) as 
c <- RegRead(rc, 128) 
b <- RcgRead(rb, 128) 
for i «— 0 to 128-prec by prcc 
ci<-F(prcc,cj+ prcc .l.j) 
bi^F(prec,bi+prec-l..i) 
case op of 

E.ADDF: 

ai «— foddrtci^i.round) 
E.MUL.F: 

ai 4— finul(ci,bi) 
E.MUL.C.F: 

if (i and prec) then 

ai +- fadd(mul(prec,e,i v b.i-prec), muKprec^i-precb,*)) 

else 

ai fsub(mul(prec,c > l,b,l), muKprec^i+prcc^H-prec)) 

endif 
E DIV.F : 

ai <- fdivfe^bi) 

endcase 

a i+prec-l. a +~ PackF(prec f ai, round) 
endfor 

RegWrittKrd, 128, a) 
enddef 



FIG. 30C 
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Operation codes 



E.SUB.F.16 


Ensemble subtract floating-point half 


E.SUB.F.16.C 


Ensemble subtract floating-point half ceiling 


E.SUB.F.16.F 


Ensemble subtract floating-point half floor 


E.SUB.F.16.N 


Ensemble subtract floating-point half nearest 


E.SUBF16.Z 


Ensemble subtract floating-point half zero 


E.SUB.F.16.X 


Ensemble subtract floating-point half exact 


E.SUB.F.32 


Ensemble subtract floating-point single 


E.SUBF.32.C 


Ensemble subtract floating-point single ceiling 


E.SUBF.32.F 


Ensemble subtract floating-point single floor 


E.SUB.F.32.N 


Ensemble subtract floating-point single nearest 


E.SUB.F.32.Z 


Ensemble subtract floating-point single zero 


RSUB.F.32.X 


Ensemble subtract floating-point single exact 


E.SUB.F64 


Ensemble subtract floating-point double 


E.SUB.R64C 


Ensemble subtract floating-point double ceiling 


E.SUB.F.64.F 


Ensemble subtract floating-point double floor 


E.SUBF64.N 


Ensemble subtract floating-point double nearest 


E.SUB.F.64.Z 


Ensemble subtract floating-point double zero 


E.SUB.R64.X 


Ensemble subtract floating-point double exact 


E SUB F. 128 


Ensemble subtract floating-point quad 


E.SUB.F.128.C 


Ensemble subtract floating-point quad ceiling 


ESUB.F.128.F 


Ensemble subtract floating-point quad floor 


E.SUB.F.128.N 


Ensemble subtract floating-point quad nearest 


E.SUB.F.128.Z 


Ensemble subtract floating-point quad zero 


E.SUBIM28.X 


Ensemble subtract floating-point quad exact 
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Selection 



class 


op 


prcc 


round/trap 


set 


SET. 

E LG 
L GE 


16 32 64 128 


NONE X 


subtract 


SUB 


16 32 64 128 


nokeCFNXZ 



Format 

E.op.prec.round rd=rb,rc 
rd=eopprecround(rb,rc) 

31 24 23 18 17 12 U 65 0 

1 E.prec 1 rd | rc 1 rb | op.round | 

S 6 6 6 6 
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Definition 

def EnsembleRcversedFloaiingPoini(op,prec,round,rd,rc > ib) as 
c 4- RegRead(rc, \2%) 
b 4- RegRead(rb, 128) 
for i <- 0 to 128-prec by prcc 

ci-*-F(prcc,ci+prec.i..i) 

bi «- F(prec,bi+p re c.i..i) 

ai ■*— frsubr(ci,-bi, round) 

a i+prec-l..i *~ PackF(prec, ai, round) 
endfor 

RegWrite(rd, 128, a) 
cnddcf 
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Operation codes 



X.COMPRESS.2 


Crossbar compress signed pecks 


X.COMPRESS.4 


Crossbar compress signed nibbles 


X.COMPRESS.8 


Crossbar compress signed bytes 


X.COMPRESS.16 


Crossbar compress signed doublets 


X.COMPRESS32 


Crossbar compress signed quadkts 


X.COMPRESS64 


Crossbar 00m press signed octlcts 


X.COMPRESS.128 


Crossbar compress signed bexlet 


X.COMPRES S.U .2 


Crossbar compress unsigned peeks 


XC0MPRESSU.4 


Crossbar compress unsigned nibbles 


X.COMPRESS.U.8 


Crossbar compress unsigned bytes 


X.COMPRESS U 16 


Crossbar compress unsigned doublets 


X.COMPRESS.U 32 


Crossbar compress unsigned qua diets 


X.COMPRESS.U 64 


Crossbar compress unsigned octlcts 


X.COMPRESS.U. 128 


Crossbar compress unsigned hexlct 


XJEXPAND.2 


Crossbar expand signed pecks 


XJEXPAND.4 


Crossbar expand signed nibbles 


X.EXPAND.8 


Crossbar expand signed bytes 


X.EXPAND.16 


Crossbar expand signed doublets 


X.EXPAND 32 


Crossbar expand signed quadlcts 


X EXPAND 64 


Crossbar expand signed edicts 


X.EXP AND . 128 


Crossbar expand signed bexlet 


A. JiAF ANJLJ. U . Z 


Crossbar expand unsigned pecks 


X.EXPAND.U.4 


Crossbar expand unsigned nibbles 


X.EXPAND.U8 


Crossbar expand unsigned bytes 


XEXPAND.U.16 


Crossbar expand unsigned doublets 


XEXPAND.U.32 


Crossbar expand unsigned qua diets 


XEXPAND.U.64 


Crossbar expand unsigned octlcts 


XEXPANDU 128 


Crossbar expand unsigned bexlet 


XROTL.2 


Crossbar rotate left pecks 


XROTL.4 


Crossbar rotate left nibbles 


XR0TL8 


Crossbar rotate left bytes 


XROTL.16 


Crossbar rotate left doublets 


XROTLJ2 


Crossbar rotate left quadlets 


XROTL.64 


Crossbar rotate left octlcts 


XROTL.128 


Crossbar rotate left hexkt 


XROTR.2 


Crossbar rotate right peeks 


XROTR.4 


Crossbar rotate right nibbles 


X.ROTK8 


Crossbar rotate right bytes 


X.ROT1U6 


Crossbar rotate right doublets 



FIG. 32A-1 
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Pmetk'ip mitt* nntit ah«^W« 

vrossoar iuuuc ngro quaoiexs j 


V ROTH fJk 


wrussoar rocuc ngni octtcts 


Y ROTO 17R 1 


v^rossoar ttjisk ngni ncxict 


y cm 0 


Crossbar shift left pecks 




Crossbar shift left signed pedes check overflow 


Y cilT k 


crossoar stun left ruoojes 




Crossbar shift left signed nibbles check overflow 


A.orlL.o 


Crossbar snjit left bytes 


A. MIL. 5. U 


Crossbar shift left signed bytes check overflow 


X.SHL.16 


Crossbar shift left doublets 


X.SHL.16.0 


Crossbar shift left signed doublets check overflow 


V CUT 

X.SHL.32 


Crossbar shift left quadlets 


v out n f~\ 61 

X.2>rlL.32.U 


Crossbar shift left signed quadlets check overflow 


v cm 


Crossbar shift left octlets 


X.SHL.64.0 


Crossbar shift left signed octlcts check overflow 


X.SHL.128 


Crossbar shift left hexlet 


XSHL.128.0 


Crossbar shift left signed hexlet check overflow j 


XSHL.U.2.0 


Crossbar shift left unsigned peeks check overflow 


V OUT II J A 

A.SHL.U.4.0 


Crossbar shift left unsigned nibbles check overflow 


XSHL.U.8.0 


Crossbar shift left unsigned bytes check overflow 


"V CUT IT 1^ r\. 

XSHLAJ. 16X) 


Crossbar shift left unsigned doublets check overflow 




Crossbar shift left unsigned quadlets check overflow 


"V cut ft <c>i r\ 


Crossbar shift left unsigned octlets check overflow 


XSHL.U128.0 




XSHR.2 


Crossbar signed shift right pecks 


XSHR.4 


Crossbar signed shift right nibbles 


XSHR.8 


Crossbar signed shift right bytes 


XSHR.16 


Crossbar signed shift right doublets 


XSML32 


Crossbar signed shift right quadlets 


XSHR.64 


Crossbar signed shift right octkts 


XSHK128 


Crossbar signed shift right hexlet 


XSMLU2 


Crossbar shift right unsigned pecks 


XSHR.U.4 


Crossbar shift right unsigned rubbles 


XSMLU.8 


Crossbar shift light unsigned bytes 


XSHR.U.16 


Crossbar shift right unsigned doublets 


XSHR.U.32 


Crossbar shift right unsigned quadlets 


XSHLU.64 


Crossbar shift right unsigned octlets 


XSHR.U.12S 


Crossbar shift right unsigned bexkt 



FIG. 32A-2 
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Sdcttioo 



class 


op 


size 


precision 


EXPAND EXPAKD.U 
COMPRESS COMPRESS.U 


2 4 8 16 32 64 128 


shift 


ROTR ROTL SHR SHL 
SHL.O SHL.U.OSHRU 


2 4 8 16 32 64 128 



Format 

X.op.size rd=rc,rb 
rd=xopsize(rc,rb) 

31 2 . 24 , 23 18 17 12 11 6 5 1 

1 XSHIFT |s| rd 1 rc | rfa 1 op Ul 

7 1 6 6 6 4 2 

Jsize «- log(size) 



FIG. 32B 
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Definition 

def CrossbarCop.size.rd^rb) 
c 4- RegRead(re, 128) 
b4-RegRead(rt>, 128) 
shift band (size- 1) 
case ops.2 HO 2 of 

X.COMPRESS: 

hsize «- size/2 

for i «- 0 to 64-hsize by hsize 
if shift £ hsize then 

ar+hsize-L i «- cj+j+ s hift+hsize-l .t+i+shift 

else 

ai+hsize-l.i 4- c$i(yg?f II ci+i^^. W +shift 

endif 

endfor 

*127„64*-Q 
X. COMPRESS. U: 

hsize «- size/2 

for i ^- 0 to 64-hsize by hsize 
if shift £ hsize then 

ai+hsize-1. i «- «4+i+shift+hsize-l..i+i+shift 

else 

ai+hsize-L.i «- osnift-hsize ,| ci+i+siz^i. i +i+sh ift 

endif 

endfor 

ai27..64<-0 
X.EXP AND: 

hsize «- size/2 

for i «- 0 to 64-hsize by hsize 
if shifts haze then 

*r+i+size-l..i+i cfcjjjfcgp B c*fcaze-l. i II 0*» 

else 

ai+f+size-1. J+t «- ci+aztshifL-lJ II ^> ahift 

endif 

endfor 



FIG. 32C-1 
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X.EXPANDAJ: 

hsize 4- size/2 

for i 4- 0 to 64-hsize by hsize 
if shift £ hsize then 

aw+stze-l iH <- oW»*Ml g q+hsi2e ., . || qsUR 

else 

■Hi+stee-l.i+i +- q+sire-shift-l .i II 0^ 



endfor 
X.ROTL: 

for i <- 0 to 128-size by size 

ai+aze-1 i «- c*+size-l-.shift..i II ci+aze-1. j+size-l-sMft 

endfor 

X.ROTR: 

for i 4- 0 to 128-size by size 

3t+size-l..i«- Ct+shift-t.j II ci+size-l.i+shift 
endfor 
X.SHL: 

for i 4- 0 to 128-size by size 

ai+size-l ..i <- ci+size-I-shiftJ II 0*** 
endfor 
X.SHL.O: 

for i <- 0 to I2fc-si2e by size 

if ci+s!2c-l..rHi2e-l-shift * cffe^l-shift 
raise FixedPointAnthmetic 

endif 

*i+size-l. j «- cj+size-l-shiftj 
endfor 



FIG. 32C-2 
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X.SHL.U.O: 

for i «- 0 to 128-size by size 

»f ci+sizc-i.j+sirc-shift * then 
raise FixedPointArithmetic 

endif 

ai+5ue-l..i «- ci+ 5 i 2e -i-shift..iP shift 
eodfor 
X.SHR: 

for i «- 0 to 128-size by size 

ai+size-l.J <~ cfMe-1 II c i+size-l..t+shift 
eodfor 
X.SHR.U: 

for i «- 0 to 128-size by size 

aj+size-l.j *- 0^ || c i+size . Li+shift 
cndfor 

endcase 

RegWrite(rd, 128, a) 
enddef 



FIG. 32C -3 
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Compress 32 bits to 16, with 4-bit right shift 



FIG, 320 
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Format 

X.EXTRACT ra=rd,rc,rb 

ra=xextraci(rd,rc, rb) 

M 14 23 IB H 12 U 65 0 

I °P I rd | rc | rb I ra 1 

* 6 6 6 6 



FIG. 33A 
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Definition 

def CrossbarExtiact(op,ra,rb,rc,nl) as 
d +- RcgRead(ni, 12S) 
c «- RegRead(rc, 128) 
b*-R*gRcad<ib, 128) 
case bg. Oof 
0..255: 

gsize <- 128 
256.383: 

gsize «-64 
384..447: 

gsize «- 32 
448..479; 

gSUC <r~ 16 

480.495: 

gsize «- 8 
496..503: 

gsize «- 4 
504..507: 

gsize <- 2 
508..5U: 

gsize «- 1 

endcase 
m<-b\2 

as «- signed *-bi4 
h <- (2-m)*gsize 

spos <- (b8..0) and ((2-m)«gsizc-l) 
dpos <- (0 9 b23 .16) and (gsize- 1) 
sfsize 4- (0 1| b3 1 24) and (gsize-1) 

tfeize <- (sfsize = 0) or ((s&izc+dpos) > gsize) ? gsizc-dpos : sfsize 
fcize 4- (tfiize + spos > h) ? b - spos : tfsize 
for i 4- 0 to 1 28 -gsize by gsize 
case op of 

X.EXTRACT: 
if in Chen 

P«-dgsize+i-l..i 

else 

P<"Wlc>2»(gfl 2r H)-1..2»i 

endif 

endcase 

v <- (as & pb-lMp 

w<- (as & vspos+feire-l)^* 6 ^^ [I vfe^*^ spos li 
if m then 

asize-l+i J «- ^gsue- l+i.dpos+fsize+i I wdpos+tsizc-Lc^xw fi cdpos-l-M. j 

else 

asize-l+t..i«- w 

endif 
cadfor 

RegWrit^ra, 128, a) 
enddef 



FIG. 33B 
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fsize spos 



s.t 



rd 



rclirb 



2* gsize 







s 


ab 


0 



rd 



fsize dpos 



Crossbar extract 

FIG. 33C 



fsize 



spos 



y gsize\^ 



rd 



rc 



rb 



rd 



fsize 



Crossbar merge extract 



FIG. 33D 
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X.SHUFFLE.4 


Crossbar shuffle within pecks 


X.SHUFFLE.& 


Crossbar shuffle within bytes 


X. SHUFFLE. 16 


Crossbar shuffle within doublets 


XSHUFFLE32 


Crossbar shuffle within quadlets 


X.SHUFFLE.64 


Crossbar shuffle within octlets 


X. SHUFFLE. 12S 


Crossbar shuffle within hexiet 


X.SHUFFLE.256 


Crossbar shuffle within triclet 



FIG. 34A 
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Format 



X.SHUFFLE256 rd^rcrb.v/w.h 
X.SHUFFUE.azc rd=rcb,Y,w 

rd=xshuffle256(rc,rb>v,w ) h) 
rd=xshufflesize(rcb,v > w) 

31 2% 23 18 17 12 jj 6 5 

1 X. SHUFFLE 1 rd | rc~ 



rb 



rc «— rb «— rcb 
x*-k)g2(size) 
y<-log2(v) 
z<-log2(w) 

op «- ((x»x*x-3*x*x-4»x)/6Kz*z-2)/2+x*z+y) + (size=256)*(h*32-56) 



FIG. 34B 
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Definition 

dcfCrossbarShufile(it^ijor,rd,rc,rb,op) 
c 4- RegRead(rc, 128) 
b <- RegReadtrb, 128) 
if rc=rb then 
case op of 
0.55: 

for x 2 to 7; for y «- 0 to x-2 ; for z <- 1 to x-y- 1 

if op = ((x*x*x-3*x*x-4*x)/Hz*z-zV2+x*z+y) then 
fori«-0tol27 

a i <~ c (i6..x 8 iy+z-L.y II ^x-L.y+z I iy-J..o) 

end 

endif 

endfor; endfor, endfor 
56.63; 

raise Reservedlnstniction 

endcase 
dstif 

case op4 0 of 
0.27: 

cb<-c||b 

h op 5 

for y 4- 0 to x-2; for z <- 1 to x-y-1 
if op4.0 « «l7»z-z»zy2-8+y) then 
fori<-hn28tol27+hU28 

ai-h»128 <- cb(i^ 2 „, y 1 u-L.y+z || i y . L . 0 ) 

cod 

endif 
endfor; endfor 
28..31: 

raise Reservedlnstniction 

endcase 

endif 

RegWritelrd, 128, a) 
enddef 



FIG. 34C 
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SYSTEM WITH WIDE OPERAND 
ARCHITECTURE, AND METHOD 

RELATED APPLICATIONS 

This application is a continuation of U.S. patent applica- 
tion Ser. No. 09/382,402, filed Aug. 24, 1999, now U.S. Pal. 
No. 6,295,599, which claims the benefit of priority to 
Provisional Application No. 60/097.635 filed on Aug. 24, 
1998, and which is a continuation-in-part of U.S. patent 
application Ser. No. 09/169,963, tiled Oct. 13, 1998, now 
U.S. Pal. No. 6,006,318, which is a continuation of U.S. 
patent application Ser. No. OS/754,827, filed Nov. 22, 1996 
now U.S. Pat. No. 5,822,603, which is a divisional of U.S. 
patent application Ser. No. 08/516,036. filed Aug. 16, 1995 
now U.S. Pat. No. 5,742,840. 

MICROFICHE APPENDIX 

Include herewith as an Appendix are 5 sheets of micro- 
fiche of 63 frames each. 

FIELD OF THE INVENTION 

The present invention relates to general purpose processor 
architectures, and particularly relates to wide operand archi- 
tectures. 

REFERENCE TO A "SEQUENCE LISTING,** A 
TABLE, OR A COMPUTER PROGRAM LISTING 
APPENDIX SUBMITTED ON A COMPACT 
DISK 

This application includes an appendix, submitted here- 
with in duplicate on compact disks labeled as "Copy 1* T and 
"Copy 2." Jlie contents of the compact disks are hereby 
incorporated by reference. 

BACKGROUND OF THE INVENTION 

The performance level of a processor, and particularly a 
general purpose processor, can be estimated from the mul- 
tiple of a plurality of interdependent factors: clock rate, gates 
per clock, number of operands, operand and data path width, 
and operand and data path partitioning. Clock rate is largely 
influenced by the choice of circuit and logic technology, but 
is also influenced by the number of gates per clock. Gates 
per clock is how many gates in a pipeline may change state 
in a single clock cycle. This can be reduced by inserting 
latches into the. daia path: when the number of gates between 
latches is reduced, a higher clock is possible. However, the 
aiklilional latches produce a longer pipeline lengih, and ihus 
come at a cost of increased instruction latency. The number 
of operands is straightforward; for example, by adding with 
carry -save techniques, three values may be added together 
with little more delay than is required for adding two values. 
Operand and data, path width defines how much data can be 
processed ai once; wider data paths can perform more 
complex functions, but generally this comes at a higher 
implementation eosi. Operand and data path partitioning 
refers to the efficient use of the data path as width is 
increased, with the objective of maintaining subsiamially 
peak usage. 

The last factor, operand and data path partitioning, is 
treated extensively in commonly-assigned U.S. Pat. Nos. 
5,742,840, 5,794,060, 5,794,061, 5,309321, and 5,S22,603, 
which describe systems and methods for enhancing the 
utilization of a general purpose processor by adding classes 
of instructions. These classes of instructions use the contents 
of general purpose registers as data path sources, partition 
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the operands into symbols of a specified size, perform 
operations in parallel, catenate the results and place the 
catenated results into a general -purpose register. These 
patents, all of which are assigned to the same assignee as the 

5 present invention, teach a general purpose microprocessor 
which has been optimized for processing and transmitting 
media data streams tlirough significant parallelism. 

While the foregoing patents offered significant improve- 
ments in uiilization and performance of a general purpose 

iu microprocessor, particularly for handling broadband com- 
munications such as media data streams, other improve- 
ments are possible. 

Many general purpose processors have general registers 
lo store operands for instructions, with the register width 

^ matched to the size of the data path. Processor designs 
generally limit Ihe number of accessible registers per 
instruction because the hardware to access these registers is 
relatively expensive in power and area. While the number of 
accessible registers varies among processor designs, it is 

20 often limited to two, three or four registers per instruction 
when such inst met ions are designed to operate in a single 
processor clock cycle or a single pipeline flow. Some 
processors, such as the Motorola 68000 have instruct tous to 
save and restore an unlimited number of registers, but 

25 require multiple cycles to perform such an instruction. 

1 "he Motorola 68000 also attempts to overcome a narrow 
data path combined with a narrow register file by taking 
multiple cycles or pipeline flows to perform an instruction, 
and thus emulating a wider data path. However, such mul- 
tiple precision techniques offer only marginal improvement 
in view of the additional clock cycles required. The width 
and accessible number of the general purpose registers thus 
fundamentally limits the amount of processing that can be 
performed by a single instruction in a register-based 
machine. 

Existing processors may provide instructions that accept 
operands for which one or more operands are read from a 
general purpose processor's memory system. However, as 

4f) these memory operands are generally specified by register 
operands, and the memory system data path is no wider than 
the processor data path, the width and accessible number of 
general purpose operands per instruction per cycle or pipe- 
line flow is not enhanced. 

45 The number of general purpose register operands acces- 
sible per instruction is generally limited by logical complex- 
ity and instruction size. For example, il might be possible to 
implement certain desirable but complex functions by speci- 
fying a large number of general purpose registers, but 

?0 substantial additional logic would have to be added to a 
conventional design to permit simultaneous reading and 
bypassing of the register values. While dedicated registers 
have been used in some prior art designs to increase the 
number or size of source operands or results, explicit 

5 - instructions load or store values into these dedicated 
registers, and additional instructions are required to save and 
restore these registers upon a change of piocessoi context. 

There has therefore been a need for a processor system 
capable or efficient handling of operands uf greater width 

60 than cither the memory system or any accessible general 
purpose register. 

SUMMARY OF THE INVENTION 

The present invention provides a system and method for 
65 improving the performance of general purpose processors by 
expanding at least one source operand to a width greater than 
the width of either the general purpose register or ihe data 
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path width. In addition, several classes of instructions will FIG. 2 is a matrix representation of a wide matrix multiply 

be provided which cannot be performed efficiently if the in accordance with an exemplary embodiment of the present 

operands are limited to the width and accessible number of invention. 

general purpose registers. FIG. 3 is a further representation of a wide matrix 

In the present invention, operands are provided which are 5 multiple in accordance with an exemplary embodiment of 

substantially larger than the data path width of the processor. the present invention 

This is achieved, in part, by using a general purpose register , . . .. ...«.-, 

to specifv a memory address from which at least more than M F1G ' \ 15 a s > stcm lcvcl dia * ram sho ™ n S ^ pactional 

one, hut typically several data path widths of data can be hIl)cks oF a s > slcm ™aiipiiraling a combined Simultaneous 

read. To permit such a wide operand to be performed in a ... Multl fading and Decoupled Access trom Execution 

single cycle, the data path functional unit is augmented with processor in accordance with an exemplary embodiment of 

dedicated storage to which the memory operand is copied on lne present invention. 

an initial execution of the instruction. Further execution of FIG. 5 illustrates a wide operand in accordance with an 

the instruction or other similar instructions that specify the exemplary embodiment of the present invention, 

same memory address can read the dedicated storage to F IG. 6 illustrates an approach to specifier decoding in 

obtain the operand value. However, such reads are subject to • accordance with an exemplary embodiment of the present 

conditions to verify that the memory operand has not been invention 

altered by intervening instructions. If the memorv operand _ ... . .. . c . 

remains current-thai is. the conditions arc Wt-thc c .' 7 grates " operational block form a Wide 

memory operand fetch can be combined with one or more F« ncuon Um! in accordance Wllh an exemplary embodiment 

register operands in the functional unit, producing a result. 20 ot lne P reseDt invention. 

The size of the result is, typically, constrained to that of a FIG. 8 illustrates in flow diagram form the Wide Micro- 
general register so that no dedicated or other special storage cache control function in accordance with an exemplary 
is required for the result. embodiment of the present invention. 

Exemplary instructions using wide operations include FIG. 9 illustrates Wide Microcache data structures in 

wide instructions that perform bit level switching (Wide 25 accordance wiih an exemplary embodiment of the present 

Switch), byte or larger table-lookup (Wide Translate), Wide invention. 

Multiply Matrix, Wide Multiply Matrix Extract, Wide Mul- F1GS 10 and u iUusUa|e a Wide Microcache conlroi in 

tiply Matrix Extract Immediate, Wide Multiply Matrix i , u f ., 

r , . ■ t , «» * • « j.m • ~ . f J accordance with an exemplary embodiment of the present 

Floating point, and Wide Multiply Matrix Galois. invention 

Another aspect of the present invention addresses efficient r^.^^ t, , o ■ L • 

usa*e (if a multiplier array that is fully used for high " FIGS 12A ~^ Pirate a W,de Switch instruction in 

precision arithmetic, but is only partly used for other, lower accordance with an exemplary embodiment ot the present 

precision operations. This can be accomplished by extract- invention. 

ing the high -order portion of the multiplier product or sum FIGS. 13A-13D illustrate a Wide Translate instruction in 

of products, adjusted by a dynamic shift amount from a ^ accordance with an exemplary embodiment of the present 

general register or an adjustment specified as part of the invention. 

instruction, and rounded by a control value from a register p)GS. 14A-14E illustrate a Wide Multiply Matrix instruc- 

or instruction portion. The rounding may be any of several lion in accort i anC e with an exemplary embodiment of the 

types, including round-to-nearesi/even, toward zero, floor, present invention 

or ceiling. Overflows are typically handled by limiting the .-^-.i „ rJ , • , w - ^ 

result to the largest and smallest 'values that can be accu- . F,GS . 15A-15F Ulustrate a Wide Multiply Matnx Extract 

ratcly represented in the output result. instruction in accordance with an exemplary embodiment of 

When an extract is controlled by a register, the size of the ,he preSenl mventl0n 
result can be specified, allowing rounding and limiting to a F1GS 1*A-16E illustrate a Wide Multiply Matrix Extract 
smaller number of bits than can fit in the result. This permits Immediate instruction in accordance with an exemplary 
the result to be scaled for use in subsequent operations 45 embodiment of the present invention, 
without concern of overflow or rounding. As a result, HGS. 17A-17E illustrate a Wide Multiply Matrix Float- 
performance is enhanced. In those instances where, the ing point instruction in accordance with an exemplary 
extract is controlled by a register, a single register value embodiment of the present invention, 
defines ihe size of the operands, the shift amount and size of FIGS. 18A-18D illustrate a Wide Multiply Matrix Galois 
the result, and the rounding control. By placing such control ?0 instruction in accordance with an exemplary embodiment of 
information in a single register, the size of the instruction is the present invention. 

reduced over the number of bits that such an instruction FIGS. 19A-19G illustrate an Ensemble Extract Irrplace 

would otherwise require, again improving performance and instruction in accordance with an exemplary embodiment of 

enhancing processor flexibility. Exemplary instructions are ihe present invention. 

Ensemble Convolve Extract, Ensemble Multiply Extract, 55 ^ ^Ar-lto l]]xisXfMt an EnsemWc Extracl instruction 

Ensemble Multiply Add Extract, and Ensemble Scale Add in accordance wilh an exemplarv embodiment of the present 

Extract. With particular regard to the Ensemble Scale Add invention 

Extract Instruction, the extract conlroi information is com- n^c- ■»<■ a nr» -n c . j n - i j 

.... . . . . . . .... FIGS. 21A-21F illustrate a System and Privileged 

bined in a register with two values used as scalar multipliers .t ,~ .. • , , , »• 

, . . . r . i- -j xu TT *n Library Calls in accordance with an exemplary embodiment 

to the contents ot two vector multiplicands. This combma- ou c l. ' 

. . , c - . , . , , of the present invention. 

nnn reduces the number of registers otherwise required, thus . ^S . . . „ . 

reducing the number of bits required tor Ihe instniction. n FiGS 22A 7 22B »»»»i"ie an Ensemble Scale-Add 

Floating-point instruction in accordance with an exemplary 

THE FIGURES embodiment of the present invention. 

FIG. 1 is a system level diagram showing the functional 65 FIGS. 23A-23C illustrate a Group Boolean instruction in 

blocks of a system in accordance with an exemplary accordance with an exemplary embodiment of the present 

embodiment of the present invention. invention. 
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FIGS. 24A-24C illustrate a Branch Hint instruction in 
accord ance with an exemplary embodiment of the present 
invention. 

FIGS. 25A-25D illustrate an Ensemble Sink Floating- 
point instruction in accordance with an exemplary embodi- 
ment of the present invention. 

FIGS. 26A-26C illustrate Group Add instructions in 
accordance with an exemplary embodiment of the present 
invention. 

FIGS. 27A-27C illustrate Group Set instructions and 
Group Subtract instructions in accordance with an exem- 
plary embodiment of the present invention. 

FIGS. 28A-28C illustrate Ensemble Convolve, Ensemble 
Divide, Ensemble Multiply, and Ensemble Multiply Sum 
inslrucliuns in accordance with an exemplary embodiment 
of the present invention. 

FIG. 29 illustrates exemplary functions that are defined 
for use within the detailed instruction definitions in other 
sections. 

FIGS. 30A-30C illustrate Ensemble Floating-Point Add, 
Ensemble Floating-Point Divide, and Ensemble Floating- 
Point Multiply instructions in accordance with an exemplary 
embodiment of the present invention. 

FIGS. 31A-31C illustrate Ensemble Floating-Point Sub- 
tract instructions in accordance with an exemplary embodi- 
ment of the present invention. 

FIGS. 32A-32D illustrate Crossbar Compress, Expand, 
Rotate, and Shift instructions in accordance with an exem- 
plary embodiment of the present invention. 

FIGS. 33A-33D illustrate Extract instructions in accor- 
dance with an exemplary embodiment of the present inven- 
tion. 

FIGS. 34A-34E illustrate Shuffle instructions in accor- 
dance with an exemplary embodiment of the present ^inven- 
tion. 

DETAILED DESCRIPTION OF THE 
INVENTION 
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Processor Layout 

Referring first to FIG. 1, a general purpose processor is 
illustrated therein in block diagram form. In FIG. 1, four 
copies of an access unit are shown, each with an access 45 
instruction fetch queue A-Queue 101-104. Each access 
instruction fetch queue A-Queue. 101-104 is coupled to an 
access register file AR 105-108, which are each coupled to 
two access functional units A 109-116. In a typical 
embodiment, each thread of the processor may have on the 
order of sixty-four general purpose registers (e.g., the AR*s 
105-108 and ERs 125-128). The access units function 
independently for four simultaneous threads of execution, 
and each compute program control (low by performing 
arithmetic and branch instructions and access memory by 
performing load and store instructions. These access units 
also provide wide operand specifiers for wide operand 
instructions. ITiese eight access functional units A 1U9-116 
produce results for access register files AR 105-108 and 
memory addresses to a shared memory system 117-120. 

In one embodiment, the memory hierarchy includes 
on-chip instruction and data memories, instruction and data 
caches, a virtual memory facility, and interfaces to external 
devices. In FIG. 1, the memory system is comprised of a 
combined cache and niche memory 117. an external bus 
interface 118. and. externally to the device, a secondary 
cache 119 and main memory system with I/O devices 120. 
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The memory contents fetched from memory system 
117-120. are combined with execute instructions not per- 
formed by the access unit, and entered into the four execute 
instruction queues E-Queue 121-124. For wide instructions, 
memory contents fetched from memory system 117-120 are 
also provided to wide operand microcaches 132-136 by bus 
137. Instructions and memory data from E-qucue 121-124 
are presented to execution register files 125-128, which 
fetch execution register file source operands. The instruc- 
tions are coupled to the execution unit arbitration unit 
Arbitration 131, that selects which instructions from the four 
threads are to be routed to the available execution functional 
units E 141 and 149, X 142 and 148, G 143-144 and 
146-147, and T 145. The execution functional units E 141 
and 149, the execution functional units X 142 and 148, and 
the execution functional unit T 145 each contain a wide 
operand microcachc 132-136, which are each coupled to the 
memory system 117 by bus 137. 

The execution functional units G 143-144 and 146-147 
arc group arithmetic and logical units that perform simple 
arithmetic and logical instructions, including group opera- 
tions wherein the source and result operands represent a 
group of values of a specified symbol size, which arc 
partitioned and operated on separately, with results cat- 
enated together. In a presently preferred embodiment the 
data path is 128 bits wide, although the present invention is 
not intended to be limited to any specific size of data path. 

The execution functional units X 142 and 148 are crossbar 
switch units that perform crossbar switch instructions. The 
crossbar switch units 142 and 148 perform data handling 
operations on the data stream provided over the data path 
source operand buses 151-158, including deals, shuffles, 
shifts, expands, compresses, swizzles, permutes and 
reverses, plus the wide operations discussed hereinafter. In 
a key element of a first aspect of the invention,, at least one 
such operation will be expanded to a width greater than the 
general register and data path width. 

The execution functional units E 141 and 149 are 
ensemble units that perform ensemble instructions using a 
large array multiplier, including group or vector multiply 
and matrix multiply of operands partitioned from data path 
source operand buses 151-158 and treated as integer, float- 
ing point, polynomial or Galois field values. Matrix multiply 
instructions and other operations utilize a wide operand 
loaded into the wide operand miirocache 132 and 136. 

The execution functional unit T 145 is a translate unit that 
performs table-look-up operations on a group of operands 
partitioned from a register operand, and catenates the iesult. 
50 The Wide Translate instruction utilizes a wide operand 
loaded into the wide operand microcachc 134. 

The execution functional units E 141, 149. execution 
functional units X — 142, 148, and execution functional unit 
T each contain dedicated storage to permit storage of source 
55 operands including wide operands as discussed hereinafter. 
The dedicated storage 132-136, which may be thought of as 
a wide microcache, typically has a width which is a multiple 
of the width of the data path operands related to the data path 
source operand buses 151-158. Thus, if the width of the data 
path 151-158 is 128 bits, the dedicated storage 132-136 
may have a width of 256, 512, 1024 or 204$ bits. Operands 
which utilize the full width of the dedicated storage are 
referred to herein as wide operands, although it is not 
necessary in all instances that a wide operand use the 
entirety of the width of the dedicated storage; it is sufficient 
that the wide operand use a portion greater than the width of 
the memory data path of the output of the memory system 
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117-120 and ihe functional unit data path of the input of the 
execution functional units 141-149, though not necessarily 
greater than the width of the two combined. Because the 
width of the dedicated storage 132-136 is greater than the 
width of the memory operand bus 137, portions of wide 
operands are loaded sequentially into the dedicated storage 
132-136. However, once loaded, the wide operands may 
then be used at substantially the same time. It can be seen 
that functional units 141-149 and associated execution 
registers 125-128 form a data functional unit, the exact 
elements of which may vary with implementation. 

The execution register file ER 125-128 source operands 
are coupled to the execution units 141-145 using source 
operand buses 151-154 and to the execution units 145-149 
using source operand buses 155-158. The function unit 
result operands from execution units 141-145 are coupled to 
the execution register file ER 125-128 using result bus 161 
and the function units result operands from execution units 
145-149 arc coupled to the execution register fiJc using 
result bus 162. 

Wide Multiply Matrix 

The wide operands of the present invention provide the 
ability to execute complex instructions such as the wide 
multiply matrix instruction shown in FIG. 2, which can be 
appreciated in an alternative form, as well, from FIG. 3. As 
can be appreciated from FIGS. 2 and 3, a wide operand 
permits, for example, the matrix multiplication of various 
sizes and shapes which exceed the data path width. The 
example of FIG. 2 involves a matrix specified by register rc 
having 128*64/size bits (512 bits for this example) multi- 
plied by a vector contained in register rb having 128 bits, to 
yield a result, placed in register rd, of 128 bits. 

The notation used in FIG. 2 and following similar figures 
illustrates a multiplication as a shaded area at the intersec- 
tion of two operands projected in the horizontal and vertical 
dimensions. A summing node is illustrated as a line segment 
connecting a darkened dots at the location of multiplier 
products that are summed. Products that are subtracted at the 
summing node are indicated with a minus symbol within the 
shaded area. 

When the instruction operates on floatiug-point values, 
the multiplications and summations illustrated arc floating 
point multiplications and summations. An exemplary 
embodiment may perform these operations without round- 
ing the intermediate results, thus computing the final result 
as if computed to infinite precision and then rounded only 
once. 

It can be appreciated that an exemplary embodiment of 
the multipliers may compute the product in carry -save form 
and may encode the multiplier rb using Booth encoding to 
minimize circuit area and delay. It can be appreciated that an 
exemplary embodiment of such summing nodes may per- 
form the summation of the products in any order, with 
particular attention to minimizing compulation delay, such 
as by performing the additions in a binary or higher-radix 
tree, and may use carry-save adders lo perform the addition 
to minimize the summation delay. It can also be appreciated 
that an exemplary embodiment may perform the summation 
using sufficient intermediate precision that no fixed-point or 
floating-point overflows occur on intermediate results. 

A comparison of FIGS. 2 and 3 can be used to clarify the 
relation between the notation used in FIG. 2 and the more 
conventional schematic notation in FIG. 3, as the same 
operation is illustrated in these two figures. 

Wide Operand 

The operands that are substantially larger than the data 
path width of the processor are provided by using a general- 
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purpose register to specify a memory specifier from which 
more than one but in some embodiments several data path 
widths of data can be read into the dedicated storage. The 
memory specifier typically includes the memory address 

5 together with the size and shape of the matrix of data being 
operated on. The memory specifier or wide operand specifier 
can be better appreciated from FIG. 5, in which a specifier 
500 is seen to be an address, plus a field representative of the 
sizc/2 and a further field representative of width/2, where 

3D size is the product of the depth and width of the data. The 
address is aligned to a specified size, for example sixty four 
bytes, so that a plurality of low order bits (for example, six 
bits) are zero. The specifier 500 can thus be seen to comprise 
a first field 505 for the address, plus two field indicia 510 

15 within the low order six bits to indicate size and width. 

Specifier Decoding 

The decoding of the specifier 500 may be further appre- 
ciated from FIG. 6 where, for a given specifier 600 made up 

20 of an address field 605 together with a field 610 comprising 
plurality of low order bits. By a series of arithmetic opera- 
tions shown at steps 615 and 620, the portion of the field 610 
representative of width/2 is developed. In a similar series of 
steps shown at 625 and 630, the value of t is decoded, which 

i:> can then be used to decode both size and address. The 
portion of the field 610 representative of size /2 is decoded as 
shown at steps 635 and 640, while the address is decoded in 
a similar way at steps 645 and 650. 

30 Wide Function Unit 

The wide function unit may be better appreciated from 
FIG. 7, in which a register number 700 is provided to an 
operand checker 705. Wide operand specifier 710 commu- 

35 nicates with the operand checker 705 and also addresses 
memory 715 having a defined memory width. The memory 
address includes a plurality of register operands 720A n, 
which arc accumulated in a dedicated storage portion 714 of 
a data functional unit 725. In the exemplary embodiment 

40 shown in FIG. 7, the dedicated storage 71.4 can be seen to 
have a width equal to eight data path widths, such that eight 
wide operand portions 730A-H are sequentially loaded into 
the dedicated storage to form the wide operand. Although 
eight portions are shown in FIG. 7, the present invention is 

45 not limited to eight or any other specific multiple of data 
path widths. Once the wide operand portions 730A-H are 
sequentially loaded, they may be used as a single, wide 
operand 735 by the functional element 740. which may be 
any elemenl(s) from FIG. 1 connected thereto. The result of 

50 the wide operand is then provided to a result register 745, 
which in a presently preferred embodiment is of the same 
width as the memory width. 

Once the wide operand is successfully loaded into the 
dedicated storage 714, a second aspect of the present inven- 

55 tion may be appreciated. Further execution of this instruc- 
tion or other similar instructions that specify the same 
memory address can read the dedicated storage to obtain the 
operand value under specific conditions that determine 
whether the memory operand has been altered by interven- 

60 ing instructions. Assuming that these conditions are met, the 
memory operand fetch from the dedicated storage is com- 
bined with one or more register operauds in the functional 
unit, producing a result. In some embodiments, the size of 
the result is limited lo that of a general register, so that no 

65 similar dedicated storage is required for the result. However, 
in some different embodiments, the result may be a wide 
operand, to further enhance performance. 
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To permit the wide operand value to be addressed by 
subsequent instructions specifying the same memory 
address, various conditions must be checked and confirmed: 

Those conditions include: 

1. Each memory store instruction checks the memory 
address against the memory addresses recorded for the 
dedicated storage. Any match causes the storage to be 
marked invalid, since a memory store instruction 
directed to any of the memory addresses stored in 
dedicated storage 714 means that data has been over- 
written. 

2. The register number used to address the storage is 
recorded. If no intervening instructions have written to 
the register, and the same register is used on the 
subsequent instruction, the storage is valid (unless 
marked invalid by rule #1). 

3. If the register has been modified or a different register 
number is used, the value of the register is read and 
compared against the address recorded for the dedi- 
cated storage. This uses more resources than #1 
because of the need to fetch the register contents and 
because the width of the register is greater than that of 
the register number itself. If Ihe address matches, the 
storage is valid. The new register number is recorded 
for the dedicated storage. 

4. If conditions #2 or #3 are not met, the register contents 
are used to address the general-purpose processors memory 
and load tlie dedicated storage. If dedicated storage is 
already fully loaded, a portion of the dedicated storage must 
be discarded (victimized) to make room for the new value. 
'J tie instruction is then performed using the newly updated 
dedicated storage. The address and register number is 
recorded for the dedicated storage. 

By checking the above conditions, the need for saving and 
restoring the dedicated storage is eliminated. In addition, if 
the context of the pn">cessor is changed and the new context 
does not employ Wide instructions that reference the same 
dedicated storage, when the original context is restored, the 
contents of the dedicated storage are allowed to be used 
without refreshing the value from memory, using checking 
rule #3. Because the values in the dedicated storage are read 
from memory and not modified directly by performing wide 
operations, the values can be discarded at any time without 
saving tlie results into general memory. Tills property sim- 
plifies the implementation of rule #4 above. 

An alternate embodiment of the present invention can 
replace rule #J above with the following rule: 

la. Each memory store instruction checks the memory 
address against the memory addresses, recorded for the 
dedicated storage. Any match causes the dedicated 
storage to be updated, as well as the general memory. 

By use of the above rule l.a. memory store instructions 
can modify the dedicated storage, updating just the piece of 
the dedicated storage that has been changed, leaving the 
remainder intact. By continuing to update the general 
memory, it is still true that the contents of the dedicated 
memory can be discarded at any time without saving the 
results into general memory. Thus rule #4 is not made more 
complicated by this choice. Ihe advantage of this alternate 
embodiment is that the dedicated storage need not be 
discarded (invalidated) by memory store operations. 

Wide Microcache Data Structures 

Referring next to FIG. 9. an exemplary arrangement of the 
data structures of the wide microcache or dedicated storage 
114 may be better appreciated. The wide microcache 
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contents, wmc.c, can be seen to form a plurality of data path 
widths 900A-n, although in the example shown the number 
is eight. The physical address, wmc.pa, is shown as 64 bits 
in the example shown, although the invention is not limited 

5 to a specific width. The size of the contents, wmc.size, is also 
provided in a field which is shown as 10 bits in an exemplary 
embodiment. A ''contents valid" flag, wnic.cv, of one bit is 
also included in the data structure, together with a two bit 
field for thread last used, or wmc.th. In addition, a six bit 

nu field for register last used, wmc.reg. is provided in an 
exemplary embodiment. Further, a one bit Mag for register 
and thread valid, or wmc.rtv, may be provided. 

Wide Microcache Control — Software 

^ The process by which the microcache is initially written 
with a wide operand, and thereafter verified as valid for fast 
subsequent operations, may be better appreciated from FIG. 
8. Ihe process begins at 8U0, and progresses to step 805 

7Q where a check of the register contents is made against the 
stored value wmc.rc. If true, a check is made at step 810 to 
verify the thread. If true, tlie process then advances to step 
815 to verify whether the register and thread are valid. If step 
815 reports as true, a check is made at step 820 to verify 
whether the contents are valid. If all of steps 805 through 
820 return as true, the subsequent instniction is able to 
utilize the existing wide operand as shown at step 825. after 
which the process ends. However, if any of steps 805 
through 820 return as false, the process branches to step 830, 
where content, physical address and size are set. Because 

" steps 805 through 820 all lead to either step 825 or 830, steps 
805 through 820 may be performed in any order or simul- 
taneously without altering the process. The process then 
advances to step 835 where size is checked. This check 

^ s basically ensures that the size of the translation unit is 
greater than or equal to the size of the wide operand, so that 
a physical address can directly replace the use of a virtual 
address. The concern is that, in some embodiments;, die wide 
operands may be larger than the minimum region that the 

^ virtual memory system is capable of mapping. As a result, it 
would be possible for a single contiguous virtual address 
range to be mapped into multiple, disjoint physical address 
ranges, complicating the task of comparing physical 
addresses. By determining the size of the wide operand and 

^ comparing that size against the size of the virtual address 
" mapping region which is referenced, the instruction is 
aborted with an exception trap if the wide operand is larger 
than the mapping region. This ensures secure operation of 
the processor. Software can [hen re -map the region using a 

?o larger size map to continue execution if desired. Thus, if size 
is reported as unacceptable at step 835, an exception is 
generated at step 840. If size is acceptable, the process 
advances to step 845 where physical address is checked. If 
the check reports as met, the process advances to step 850, 

^ where a check of the contents valid Hag is made. If either 
check at step 845 or 850 reports as false, the process 
branches and new content is written into the dedicated 
storage 114, with the fields thereof being set accordingly. 
Whether the check at step 850 reported true, or whether new 
content was written at step 855, the process advances to step 
860 where appropriate fields are set to indicate the validity 
of the data, after which ihe requested function can be 
performed at step 825. The process then ends. 

Wide Microcache Control — Hardware 

65 

Referring next to FIGS. 10 and 11. which together show 
the operation of the microcache controller from a hardware 
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standpoint, the operation of the microcache controller may 
be better understood. In the hardware implementation, it is 
clear that conditions which are indicated as sequential steps 
in FIG. 8 and 9 above can be performed in parallel, reducing 
the delay for such wide operand checking. Further, a copy of 
the indicated hardware may be included for each wide 
microcache, and thereby all such microcachcs as may be 
alternatively referenced by an instruction can be tested in 
parallel. It is believed that no further discussion of FIGS. 10 
and 11 is required in view of the extensive discussion of 
FIGS. 8 and 9, above. 

Various alternatives to the foregoing approach do exist for 
the use of wide operands, including an implementation in 
which a single instruction can accept two wide operands, 
partition the operands into symbols, multiply corresponding 
symbols together, and add the products to produce a single 
sualai value ur a vector of pai tiliuncd values of width of the 
register tile, possibly after extraction of a portion of the 
sums. Slid) an instruction can be valuable for detection of 
motion or estimation of motion in video compression. A 
further enhancement of such an instruction can incremen- 
tally update the dedicated storage if the address of one wide 
operand is within the range of previously specified wide 
operands in the dedicated storage, by loading only the 
portion not already within the range and shifting the in-range 
portion as required. Such an enhancement allows the opera- 
tion to be performed over a "sliding window" of possible 
values. In such an instruction, one wide operand is aligned 
and supplies the size and shape information, while the 
second wide operand, updated incrementally, is not aligned. 

Another alternative embodiment of the present invention 
can define additional instructions where the result operand is 
a wide operand. Such an enhancement removes the limit that 
a result can be no larger than the size of a general register, 
further enhancing performance. These wide results can be 
cached locally to the functional uuit that created them, but 
must be copied to the general memory system before the 
storage can be reused and before the virtual memory system 
alters the mapping of the address of the wide result. Data 
paths must be added so that load operations and other wide 
operations can read these wide results — forwarding of a 
wide result from the output of a functional unit back to its 
input is relatively easy, but additional data paths may have 
to be introduced if it is desired to forward wide results back 
to other functional units as wide operands. 

As previously discussed, a specification of the size and 
shape of the memory operand is included in the low-order 
bits of the address. In a presently preferred implementation, 
such memory operands arc typically a power of two in size 
and aligned to that size. Generally, one half the total size is 
added (or inclusively or'ed, or exclusively or'ed) to the 
memory address, and one half of the data width is added (or 
inclusively orcd, or exclusively orcd) to the memory 
address. These bits can be decoded and stripped from the 
memory address, so that the controller is made to step 
through all the required addresses. This decreases the num- 
ber of distinct operands required for these instructions, as the 
size, shape and address of the memory operand are com- 
bined into a single register operand value. 

The following table illustrates the arithmetic and descrip- 
tive notation used in the pseudocode in the Figures refer- 
enced hereinafter: 
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x + \ two's complement addition of x and y. Result is the sane size as 
the operands, and operands must be of equal size. 
5 x - y lwo ; s complemcnl subtraction of y from x. Result is the same 
size as the operands, and operands must be of equal size, 
x * y two's complement multiplication of x and y. Result is the same 

size as the operands, and operands must be of equal sizs. 
x / y two's complement division of x by y. Result is the same size 
as the operands, and operands must be of equal size. 
10 \ & y bitwise and of x and y. Result is name size as the operands, 
and operands must be of equal si2e. 
x y bitwise or of x and y. Result is same size as the operands. 

and operands must be of equal si2e. 
x y bitwise exclusive-of of x and y. Result is same size as 
the operands, and operands must be of equal size. 
15 ~x bitwise inversion of x. Result is same size as the operand, 
x - y two's complement eqtality comparison between x and y. 

Result is a single bit, and operands must be of equal size, 
x y two's complement inequality comparison between x and y. 

Result is a single bit., and operands must be of equal size. 
x < y Iwo's complement less than comparison between x and y. 

Result is a single bit, and operands must be of equal size. 
x ^ y two's complement greater than or equal comparison between x 
and y. Result is a single bit, and operands must 
be of equal size. 
V ftoa ting-point square root of x 
x y concatenation of bit field x to left of bit field y 
M y binary digit x repcatcc, concatenated y times. Size of rcs.ilt is y. 
25 Xj extraction of bit y (using little-endian bit numbering) from 
value x. Result is a single bit. 
x Vi extraction of hit field fcrmed from hits y through 7. of value x. 

Size, of result is - z ■*■ 1: it i > y, result is an empty string, 
x?y:z value of y, if x is true, otherwise value of z. Value of x is 
a single biL 

3D x y bitwise assignment of x to value of y 
x.y subfield of smicmred bitfield x 

S« signed, two's convenient, binary data format of n bytes 
tin unsigned hi nary data format of n bytes 
Fn floating-point data format of n bytes 

35 

Wide Operations 

Particular examples of wide operations which are defined 
by the present invention include the Wide Switch instruction 

40 that performs bit-level switching; the Wide Translate 
instruction which performs byte (or larger) table lookup; 
Wide Multiply Matrix; Wide Multiply Matrix Extract and 
Wide Multiply Matrix Extract Immediate (discussed below), 
Wide Multiply Matrix Floa ting-point, and Wide Multiply 

45 Matrix Galois (also discussed below). While the discussion 
below focuses on particular sizes for the exemplary 
instructions, it will be appreciated that Ihc invention is not 
limited to a particular width. 

Wide Switch 

50 

An exemplary embodimeni of the Wide Switch instruc- 
tion is shown in FIGS. 12A-12D. In an exemplary 
embodiment, the Wide Switch instruction rearranges the 
contents of up to two registers (256 bits) at the bit level, 

55 producing a full-width (12$ bits) register result. To control 
the rearrangement, a wide operand specified by a single 
register, consisting of eight bits per bit position is used. For 
each result bit position, eight wide operand bits for each bit 
position select which of the 256 possible source register bits 

60 to place in the result. When a wide operand size smaller than 
128 bytes is specified, the high order bits of the memory 
operand arc replaced with values corresponding to the result 
bit position, so that the memory operand specifies a bit 
selection within symbols of the operand size, performing Ihe 

65 same operation on each symbol. 

In an exemplary embodiment, these instructions take an 
address from a general register to fetch a large operand from 
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memory, a second operand from a general register, perform 
a group of operations on partitions of bits in the operands, 
and catenate the results together, placing the result in a 
general register. An exemplary embodiment of the format 
1210 of the Wide Switch instruction is shown in FIG. 12A. 

An exemplary embodiment of a schematic 1230 of the 
Wide Switch instruction is shown in FIG. 12 B. In an 
exemplary embodiment, the contents of register re specifies 
a virtual address apd optionally an operand size, and a value 
of specified size is loaded from memory. A second value is 
the catenated conients of registers rd and rb. Eight corre- 
sponding bits from the memory value are used to select a 
single result bit from the second value, for each correspond- 
ing bit position. The group of results is catenated and placed 
in register ra. 

In an exemplary embodiment, the virtual address must 
either be aligned to 128 bytes, or must be the sum of an 
aligned address and one -half of the size of the memory 
operand in bytes. An aligned address must be an exact 
multiple of the size expressed in bytes. The size of the 
memory operand must be 8, 16, 32 : 64, or 128 bytes. If the 
address is not valid an "access disallowed by virtual 
address" exception occurs. When a size smaller than 128 bits 
is specified, the high order bits of the memory operand are 
replaced with values corresponding to the bit position, so 
that the same memory operand specifies a bit selection 
within symbols of the operand size, and the same operation 
is performed on each symbol. 

In an exemplary embodiment, a wide switch 
(WSWrrCILL or W.'SWITCII.B) instruction specifies an 
8 -bit location for each result bit from the memory operand, 
that selects one of the 256 bits represented by the catenated 
contents of registers rd and rb. 

An exemplary embodiment of the pseudocode 1250 of the 
Wide Swilch instruction is shown in FIG. 12 C. An exem- 
plary embodiment of the exceptions 1280 of the Wide 
Switch instruction is shown in FIG. 12D. 

Wide Translate 

An exemplar)' embodiment of the Wide Translate instruc- 
tion is shown in FIGS. 13A-13D. In an exemplary 
embodiment, the Wide Translate instructions use a wide 
operand to specify a table of depth up to 256 entries and 
width of up to 128 bits. The contents of a register is 
partitioned into operands of one, two, four, or eight bytes, 
and the partitions arc used to select values from the tabic in 
parallel. The depth and width of the lahle can be selected by 
specifying the size and shape of the wide operand as 
described above. 

In an exemplary embodiment, these instructions take an 
address from a general register to fetch a large operand from 
memory, a second operand from a general register, perform 
a group of operations on partitions of bits in the operands, 
and catenate the results together, placing the result in a 
general register. An exemplary embodiment of the format 
1310 of the Wide Translate instruction is shown in FIG. 13A. 

An exemplary embodiment of the schematic 1330 of the 
Wide Translate instruction is shown in FIG. 13B. In an 
exemplary embodiment, .the conients of register rc is used as 
a virtual address, and a value of specified size is loaded from 
memory. A second value is the contents of register rb. The 
values are partitioned into groups of operands of a size 
specified. The low-order bytes of the second group of values 
are used as addresses to choose entries from one or more 
tables constructed from the first value, producing a group of 
values. The group of results is catenated and placed in 
register rd. 
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In an exemplary embodiment, by default, the total width 
of tables is 128 bits, and a total table width of 128, 64, 32, 
16 or 8 bits, but not less than the group size may be specified 
by adding the desired total table width in bytes to the 

5 specified address: 16, 8, 4, 2, or 1. When fewer than 128 bits 
are specified, the tables repeat to fill the 128 bit width. 

In an exemplary embodiment, the default depth of each 
table is 256 entries, or in bytes is 32 times the group size in 
bits. An operation may specify 4, 8, 16, 32, 64, 128 or 256 

1U entry tables, by adding one half of the memory operand size 
to the address. Table index values arc masked to ensure that 
only the specified portion of the tabic is used. Tables with 
just 2 eotries cannot be specified; if 2-entry tables are 
desired, it is recommended to load the entries into registers 
and use G.MUX to select the table entries. 

In an exemplary embodiment, failing to initialize the 
entire table is a potential security hole, as an instruction in 
with a small-depth table could access table, entries previ- 
ously initialized by an instruction with a large -depth table. 

20 This security hole may be closed either by initializing the 
entire table, even if extra cycles arc required, or by masking 
the index bits so that only the initialized portion of the table 
is used. An exemplary embodiment may initialize the entire 
table with no penalty in cycles by writing to as many as 128 

2y table entries at once. Initializing tlie entire table with writes 
to only one entry at a time requires writing 256 cycles, even 
when the table is smaller. Masking the index bits is the 
preferred solution. 

?u In an exemplary embodiment, masking the index bits 
suggests that this instruction, for tables larger than 256 
entries, may be extended to a general -purpose memory 
translate function where the processor performs enough 
independent load operations to fill the 128 bits. Thus, the 16, 

15 32, and 64 bit versions of this function perform equivalent 
of 8, 4, 2 withdraw, 8, 4, or 2 load-indexed and 7, 3, or 1 
group-extract instructions. In other words, this instruction 
can he as powerful as 23, 11, or 5 previously existing 
instructions. The 8- bit version is a single cycle operation 

4f) replacing 47 existing instructions, so these extensions are 
not as powerful, but nonetheless, this is at least a 50% 
improvement on a 2-issuc processor, even with one cycle per 
load timing. To make this possible, the default table size 
would become 65536, 2~32 and 2~64 for 16, 32 and 64-bit 

„ versions of the instruction. 

4? 

In an exemplary embodiment, for the big-endian version 
of this insiruclion, in the definition below, the conienis nf 
register rb is complemented. This reflects a desire to orga- 
nize the table so that the lowest addressed table entries arc 

50 selected when the index is zero. In the logical 
implementation, complementing the index can be avoided 
by loading the table memory differently for big-endian and 
little-endian versions; specifically by loading the table into 
memory so that the highest-addressed table entries are 

55 selected when the index is zero for a big-endian version of 
the instruction. In an exemplary embodiment of the logical 
implementation, complementing the index can be avoided 
by loading the table memory differently for big-endian and 
little-endian versions. In order to avoid complementing the 

60 index, the table memory is loaded differently for big-endian 
versions of the instruction by complementing the addresses 
at which table entries are written into the table for a 
big-endian version of the instruction. 

In an exemplary embodiment, the virtual address must 

65 either be aligned to 4096 bytes, or must be the sum of an 
aligned address and one-half of the size of the memory 
operand in bytes and/or the desired total table width in bytes. 
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An aligned address must be an exact multiple of the size In an exemplary embodiment, the virtual address must 

expressed in bytes. The size of the memory operand must be either be aligned to 1024/gsize bytes (or StZ'gsize for 

a power of two from 4 to 4096 bytes, but must be at least 4 W.MUL.MALC) (with gsize measured in bits), or must be 

times the group size and 4 times the total table width. If the t h c sum of an aligned address and one half of the size of the 

addresses not valid aD "access disallowed by virtual 5 memory operand in bytes and/or one quarter of the size of 

address" exception occurs. me resu j t m bytes. An aligned address must be an exact 

In an exemplary embodiment, a wide translate multiple of the size expressed in bytes. If the address is not 

(W.TRANSLATE.8.L or W.TRANSLATE.8.B) instruction valid an "access disallowed by virtual address" exception 

specifies a translation table of 16 entries (vsize=l6) in depth, occurs. 

a group size of 1 byte (gsize»8 bits), and a width of 8 bytes 30 ] n an exemplary embodiment, a wide multiply octlets 

(wsize-64 bits). The address specifies a total table size inslniclion (W.MUL.MAT.type.64, type=NONE M U P) is 

(msize-1024bits-vsize*wsize) and a table width (wsize-64 nol implemented, and causes a reserved instruction 

b.ts) by adding one halt of the size in bytes of the table (64) exception, as an ensemble-multiply-si.m -octlets instruction 

and adding the size in bytes of the table width (8) to the table (RMU , ,. S iJM.type.64) performs the same operation except 

address m the address specification. The instruction will 15 thal |he mil]lip | ier is sourced from a 128 . bit regisIer ralher 

create di.phcates of this table in the upper and lower 64 bits than mcmory Similarly, instead of wide-multiply-complex- 

ot the data path, so thai 128 bits ot operand are processed at qi , adlel s instruction (W.MUL.M AT.C.32), one should use an 

once, yielding a 128 bit result. ensemble-multiply-complex-quadlets instruction 

An exemplary embodiment ot the pseudocode 1350 ot the (E.MUI ..SUM.C32). 
Wide Translate instruction is shown in FIG 13C An exem- 20 ^ ^ ^ ^ Q ^ cxcniplary embodiment of a 
phry embodiment ot the exceptions 1380 ot the W,de widcmultiplv-doubieis instruction (W.MUL.M AT, 
Translate instruction is shown in FIO. 130. W.MUL.MAT.M, WJvlULMAT.P, W.MUL.MAT.U) multi- 
Wide Multiply Matrix plies memory [iu31 m30 . . . ml mO] with vector [h g f e d 

An exemplary embodiment of the Wide Multiply Matrix yielding products [hm3U S ni27+ . - 

instruction is shown in FIGS. 14A-14E. In an exemplary " +bm74am3 . . hm284gm24+ . . . +bm4+am0]. 

embodiment, the Wide Multiply Matrix instructions use a A* sbown in WCi * 4c i an exemplary embodiment of a 

wide operand to specifv a matrix of values of width up to 64 wide-raultiply-matrix-comp lex-doublets instruction 

bits (one half of register file and data path width) and depth (W.MUL.MAT.C) multiplies memory [ml5 ml4 . . . ml mO] 

of up to 12S bits/symbol size. The contents of a general ?u with vector [h g f e d c b a], yielding products 

register (128 bits) is used as a source operand, partitioned * [hml4+gral5+ . . . +bm2+ara3 . . . hml2+gml3+ . . 

into a vector of symbols, and multiplied with the matrix, +bm0+aml hml3+gml2+ . . bml+amO]. 

producing a vector of width up to 128 bits of symbols of An exemplary embodiment of the pseudocode 1480 of the 

twice the size of the source operand symbols. The width and Wide Multiply Matrix iastruction is shown in FIG. 14D. An 

depth of the matrix can be selected by specifying the size exemplary embodiment of the exceptions 1490 of the Wide 

and shape of the wide operand as described above. Controls Multiply Matrix instmction is shown in FIG. 14E. 
within the instruction allow specification of signed, mixed 

signed, unsigned, complex, or polynomial operands. Wlde MuItl P l y Matnx Exlract 

In an exemplary embodiment, these instructions take an An exemplary embodiment of the Wide Multiply Matrix 

address from a general register to fetch a large operand from 40 Extract instruction is shown in HGS. 15A-15K In an 

memory, a second operand from a general register, perform exemplary embodiment, the Wide Multiply Matrix Extract 

a group of operations on partitions of bits in the operands, instructions use a wide operand to specify a matrix of value 

and catenate the results together, placing the result in a of width up to 128 bits (full width of register tile and data 

general regisler. An exemplary embodiment of the format path) and depth of up to 128 bits/symbol size. The contents 

1410 of ihe Wide Multiply Matrix instruction is shown in 45 of a general regisler (128 bits) is used as a source operand, 

FIG. 14A. partitioned into a vector of symbols, and multiplied with the 

An exemplary embodiment of the schematics 1430 and matrix, producing a vector of width up to 256 bits of 

1460 of the Wide Multiply Matrix instruction is shown in symbols of twice the size of the source operand symbols plus 

FIGS. 14B and 14C In an exemplary embodiment, the additional bits to represent thc sums of products without 

contents of register rc is used as a virtual address, and a 50 overflow. The results are then extracted in a manner 

value of specified size is loaded from memory. A second described below (Enhanced Multiply Bandwidth by Result 

value is the contents of register rb. The values are partitioned Extraction), as controlled by the contents of a general 

into groups of operands of thc size specified. Thc second regisler specified by the instruction. The general register also 

values arc multiplied with the first values, then summed, specifies the format of the operands: signed, mixed-signed, 

producing a group of result values. The group of result 55 unsigned, and complex as well as the size of Ihe operands, 

values is catenated and placed in register rd. byte (8 bit), doublet (16 bit), quadlet (32 bit), or hexlet (64 

In an exemplary embodiment, the memory multiply bit), 

instructions (W.MUL.M AT, W.MUL.MAT.C, In an exemplar}- embodiment, these instructions take an 

W.MUL.M AT. M . W.MUL.M ATP, W.MUL.MAT.U) per- address from a general regisler to fetch a large operand from 

form a partitioned array multiply of up to Si 92 bits, that is 60 memory, a second operand from a general regisler. perform 

64x128 bits. The width of ihe array can be limited to 64, 32, a group of operations on partitions of bits in the operands, 

or 16 bits, but nol smaller than twice the group size, by and catcnale thc results together, placing thc result in a 

adding one half the desired size in bytes to the virtual general register. An exemplary embodiment of the format 

address operand: 4, 2, or 1. The array can be limited 1510 of ihe Wide Multiply Matrix Extract instruction is 

vertically to 128, 64, 32, or 16 bits, but nol smaller than 65 shown in FIG. 15A. 

twice ihe group size, by adding one-half ihe desired memory An exemplary embodiment of ihe schematics 1530 and 

operand size in bytes lo the virtual address operand. 1560 of the Wide Multiply Matrix Extract instruction is 
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shown in FIGS. 15C and 14D. In an exemplary embodiment, 
the contents of register rc is used as a virtual address, and a 
value of specified size is loaded from memory. A second 
value is the contents of register rd. The group size and other 
parameters arc specified from the contents of register rb. The 
values are partitioned into groups of operands of the size 
specified and are multiplied and summed, producing a group 
of values. The group of values is rounded, and limited as 

specified, yielding a group of results which is the size ,. , , , 

'c j "t*u c u - . j i j • In an exemplary embodiment, the virtual address must be 

specified. The sroup of results is catenated and placed in . 

r . w r r aligned, that is, it must be an exact multiple of the operand 

regu er ra. . expressed in bytes. If the address is not aligned an 

In an exemplary embodiment, the size of this operation is "access disallowed by virtual address" exception occurs, 

determined from the contents of register rb. The multiplier In an exemplary embodiment, Z (zero) rounding is not 

usage is constant, but the memory operand size is inversely lr> defined for unsigned extract operations, and a Keservedln- 

related to the group size. Presumably this can be checked for struction exception is raised if attempted. F (floor) rounding 

cache validity. will properly round unsigned results downward. 

. . « , - • , , . , As shown in FIG. I5C, an exemplary embodiment of a 

In an exemplary embodiment, low order bits or re are used . , ... . . j . , 

. . . . . • ... widc-multiply-matnx-cxtract-doublcts instruction 

to designate a size, which must be consistent with the group 20 (W .MUI..MAT.X.B or W.MUI..MAT.XM.) multiplies 

size. Because the memory operand is cached, the size can ffl [m63 m62 m61 _ ^ ml m0] with vector [h g f 

also be cached, thus eliminating the time required to decode e d c b a], yielding the products 

the size, whether from rb or from rc. r -» u *m *»i i i« m c a* sc 

[am7+bm 154cm23 + dm31 + em394-fm47+gm55 

In an exemplary embodiment, the wide multiply matrix , 5 +hm63 . . . am2+bmlu v +cml8+dm26+em34+fm42+gm 

extract instructions (W.MUL.M/NT.X.B, W.MUL.M AT. X.L) " 50+hm58 aml+bm9+cml7+dm25+em334fm41 + 

perform a partitioned array multiply of up to 16384 bits, that gm49+hm57 am0+bm8+cml6+dm24+em32+fm40+ 

is 128x128 bits. The width of the array can be limited to 128, gm48+hm56], rounded and limited as specified. 

64, 32, or 16 bits, but not smaller than twice the group size, M shown m 15D - an exemplary embodiment of a 

by adding one half the desired size in bytes to the virtual 30 wide-mulliply-malnx-extract^mplex-doublels instruction 

address operand: 8, 4, 2, or 1. The array can be limited ( W.MUL MAT.X w,th n set m rb) multiplies memory [m31 

vertically to 128, 64, 32. or 16 bits, but not smaller than m ™ m29 " m2 , ml m ^ ™* v f ,or h fi « f * d c ^ 

... . ... . lft . , . . yielding the products [am7+bm6+cml5+dml4+em23+ 

twice the group size, by adding one half the desired memory am2-bm3 + cml0-dmn + eml8- 

operand size m bytes to the virtual address operand. ^ f ml 9^1126-00,27 amUbm0 + cm9 + dm8 + eml7+fml6 + 

As shown in FIG. 15B, in an exemplary embodiment, bits " gm25+hm24 am0-bml+cm8-dm9+emI6-fl7+gm24 

31 ... 0 of ihe mn tents of register rb specifies several hm25], rounded and limited as specified, 

parameters which control the manner in which data is An exemplary embodiment of the pseudocode 1580 of The 

extracted. The position and default values of the control wide Multiply Matrix Extract instruction is shown in FIG. 

fields allow for the source position to be added to a fixed 4 n 15E M exemplary embodiment of the exceptions 1590 of 

control value for dynamic computation, and allow for the the Wlde Multiply Matrix Extract instruction is shown in 



lower 16 bits of the control field to be set for some of the 



FIG. 15F. 



simpler extract cases by a single GCOPYI instniction. W idc Multiply Matrix Extract Immediate 

In an exemplary embodiment, the table below describes 4 _ An exemplary embodiment of the Wide Multiply Matrix 
the meaning of each label: " Extract Immediate instruction is shown in FIGS. 16A-16E. 

In an exemplary embodiment, the Wide Multiply Matrix 
Extract Immediate instructions perform the same function as 
above, except that the extraction, operand format and size is 
controlled by fields in the instruction. This form encodes 
common forms of the above instruction without the need to 
initialize a register with the required control information. 
Controls within the instniction allow specification of signed, 
mixed signed, unsigned, and complex operands. 

In an exemplary embodiment, these instructions take an 
address from a general register to fetch a large operand from 
memory, a second operand from a general register, perform 
a group of operations on partitions of bits in the operands, 
and catenate the results together, placing the result in a 
In an exemplary embodiment, the 9 bit gssp field encodes 60 general register. An exemplary embodiment of the format 
both the group size, gsize, and source position, spos, accord- 1610 of lhe Wide Multiply Matrix Extract Immediate 
ing to the formula gssp=512 4*gsize+spos. The group size, instniction is shown in FIG. 16A. 

csizc, is a power of two in the range 1 . . . 128. The source ^ exemplary embodiment of the schematics 1630 and 

position, spos, is in the ranee 0 . . . (2*gsize)l. 1660 of lhe u Wide . Immcd , ialc 

65 instruction is shown in FIGS. 16B and 16C. In an exemplary 

In au exemplary embodiment, the values in the s, n. in, u embodiment, the contents of register rc is used as a virtual 

and rnd fields have the following meaning: address, and a value of specified size is loaded from 
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memory. A second value is the contents of register rb. The 
values are partitioned into groups of operands of the size 
specified and are multiplied and summed in columns, pro- 
ducing a group of sums. The group of sums is rounded, 
limited, and extracted as specified, yielding a group of 
results, each of which is the size specified. The group of 
results is catenated and placed in register rd. All results are 
signed, N (nearest) rounding is used, and all results are 
limited to maximum rcprcscntablc signed values. 

In an exemplary embodiment, the wide-multiply-extract- 
immediate-malrix instructions (W.MUL.M AT.X .1, 
W.MUL.MAT.X.I.C) perform a partitioned array multiply of 
up to 16384 bits, that is 128x128 bits. The width of the array 
can be limited to 128, 64, 32, or 16 bits, but not smaller than 
twice the group size, by adding one -half the desired size in 
bytes to the virtual address operand: 8, 4, 2, or 1. The array 
can be limited vertically to 128, 64, 32, or 16 bits, but not 
smaller than twice the group size, by adding one half the 
desired memory operand size in bytes to the virnial address 
operand. 

In an exemplary embodiment, the virtual address must 
either be aligned to 2048/gsize bytes (or 1024/gsize for 
W.MUL.MAT.X.LC), or must be the sum of an aligned 
address and one-half of the size of the memory operand in 
bytes and/or one half of the size of the result in bytes. An 
aligned address must be an exact multiple of the size 
expressed in bytes. If the address is not valid an "access 
disallowed hy virtual address"* exception occurs. 

As shown in FIG. 16B, an exemplary embodiment of a 
wide-multiply-extract-immediate-matrix-doublets instruc- 
tion (W.MUL.MAT.X.I.16) multiplies memory [m63 m62 
m61 . . . m2 ml mO] with vector [h g f e d c b a], yielding 
the products 

[am7+bm l5+cm23+dm31+em39+fm47+gm55+hm 
63 . . . am2+bml0+cm 18+dm26+cm34+frn42+gm50+ 
hm58 am1+bm9+cml74dm25+em33+fm41 +gm49+ 
hm57 am0+bm8+cml6+dm244em32+fm4l)4gm48+ 
hm56j, rounded and limited as specified. 
As shown in FIG. 16C, an exemplary embodiment of a 
wide-multiply-matrix -extract-immediate-complex-doublets 
instruction (W.MUL.MAT.X.I.C.16) multiplies memory 
[m31 in30 m29 . . . m2 ml mO] with vector [h g f e d c b 
a], yielding the products [am7+bm6+cml5+dml4+em23+ 
fui22+gm3l4-hm30 . . . am2-bni3+cnil0-dinll+eml8- 
fm 194gm26-hm27 ami +bra0+cm9+dm8+e m 1 7+fml 6+ 
gm25+hm24 ara0-bml+cro8-dm9+cml6-fl7+gm24- 
hm25], rounded and limited as specified. 

An exemplary embodiment of the pseudocode 1680 of the 
Wide Multiply Matrix Extract Immediate instruction is 
shown in FIG. 16D. An exemplary embodiment of the 
exceptions 1590 of the Wide Multiply Matrix Extract Imme- 
diate instruction is shown in FIG. 16E. 

Wide Multiply Matrix Floating-point 

An exemplary embodiment of the Wide Multiply Matrix 
Floating-point instruction is shown in FIGS. 17A-17E. In an 
exemplary embodiment, the Wide Multiply Matrix Floating- 
point instructions perform a matrix multiply in the same 
form as above, except that the multiplies and additions arc 
performed in floating-point arithmetic. Sizes of half (16-bit), 
single (32-bit), double (64-bit), and complex sizes of half, 
single and double can be specified within the insmiction. 

In an exemplary embodiment, these instructions lake an 
address from a general register to feich a large operand from 
memory, a second operand from a general register, perform 
a group of operations on partitions of bits in the operands, 



and catenate the results together, placing the result in a 
general register. An exemplary embodiment of the format 
1710 of the Wide Multiply Matrix Floating point instruction 
is shown in FIG. 17A. 
5 An exemplary embodiment of the schematics 1730 and 
1760 of the Wide Multiply Matrix Floating-point instruction 
is shown in FIGS. 17B and 17C. In an exemplary 
embodiment, the contents of register re is used as a virtual 
address, and a value of specified size is loaded from 
1U memory. A second value is the contents of register rb. I "he 
values are partitioned into groups of operands of the size 
specified. The second values are multiplied with the first 
values, then summed, producing a group of result values. 
The group of result values is catenated and placed in register 
is rd. 

In an exemplary embodiment, the wide-muliiply-matrix- 
fioa ting-point instructions (W.MUL.MAT.F, 
W.MUL.MAT.C.F) perform a partitioned array multiply of 
up to 16384 bits, that is 128x128 bits. lTie width of the array 
20 can be limited to 128, 64, 32 bits, but not smaller than twice 
the group size, by adding one-half the desired size in bytes 
to the virtual address operand: 8, 4, or 2. The array can be 
limited vertically to 128, 64, 32, or 16 bits, but not smaller 
than twice the group size, by adding one -half the desired 
25 memory operand size in bytes to the virtual address operand. 
In an exemplary embodiment, the virtual address must 
either be aligned to 2048/gsize bytes (or 1024/gsize for 
W.MUL.MAT.C.F), or must be the sum of an aligned 
address and one half of the size of the memory operand in 
bytes and/or one-half of the size of the result in bytes. An 
aligned address must be an exact multiple of the size 
expressed in bytes. If the address is not valid an "access 
disallowed by virtual address" exception occurs. 
^ 5 As shown in FIG. 17B, an exemplary embodiment of a 
wide-multiply-mairix-floating-point-half instruction 
(W.MUL.MAT.F) multiplies memory [m31 m30 . . . ml mO] 
with vector [h g f e d c b a], yielding products [hm31 +>gm 
274 . . . +bm7+am3 . . . hm28+gm24+ . . . 4bro4+am0]. 
40 As shown in FIG. 17C, an exemplary embodiment of a 
wide-multiply-matrix-complex-floating-point-half instruc- 
tion (W.MUL.MAT.F) multiplies memory [ml 5 ml4 . . . ml 
mO] with vector [h g f c d c b a], yielding products 
[hrnl4+gra 15+ . . . +bm2+am3 . . . hml2+gml3+ . . . 
45 +bm04aml-hml3+gml2+ . . . -bml+amO]. 

An exemplary embodiment of the pseudocode 1780 of the 
Wide Multiply Matrix Floating-point instruction is shown in 
TIG. 17D. Additional pseudocode functions used by this and 
other floating point instructions is shown in FIG. FI.OAT-1. 
50 An exemplary embodiment of the exceptions 1790 of the 
Wide Multiply Matrix Floating-point instruction is shown in 
FIG. 170. 

Wide Multiply Matrix Galois 

An exemplary embodiment of the Wide Multiply Matrix 
Galois instruction is shown in FIGS. 18A-18D. In an 
exemplary embodiment, the Wide Multiply Matrix Galois 
instructions perform a matrix multiply in the same form as 
above, except thai the multiples and additions are performed 
in Galois field arithmetic. A size of 8 bits can be specified 
within the instruction. The contents of a general register 
specify the polynomial with which to perform the Galois 
field remainder operation. The nature of the matrix multi- 
plication is novel and described in detail below. 

In an exemplar}' embodiment, these instructions take an 
address from a general register to fetch a large operand from 
memory, second and third operands from general registers, 
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perform a group of operations on partitions of bits in the 
operands, and catenate the results together, placing the result 
in a general register. An exemplary embodiment of the 
format 1810 of the Wide Multiply Matrix Galois instruction 
is shown in FIG. 18 A. 5 

An exemplary embodiment of the schematic 1830 of the 
Wide Multiply Matrix Galois instruction is shown in FIG. 
18B. In an exemplary embodiment, the contents of register 
re is used as a virtual address, and a value of specified size 
is loaded from memory. Second and third values are the ™ 
contents of registers rd and rb. The values are partitioned 
into groups of operands of the size specified. The second 
values are multiplied as polynomials with the first value, 
producing a result which is reduced to the Galois field 
specified by the third value, producing a group of result 
values. The group of result values is catenated and placed in 
register ra. 

In an exemplary embodiment, the wide-multiply-matrix- 
Galois-bytes instruction (W.MUL.MAT.G.8) performs a par- 
titioned array multiply of up to 16384 bits, that is 128x128 
bits. The width of the array can be limited to 128, 64, 32, or 
16 bits, but not smaller than twice the group size of 8 bits, 
by adding one-half the desired size in bytes to the virtual 
address operand: 8, 4, 2, or 1. The array can be limited 
vertically to 128, 64, 32, or 16 bits, but not smaller than 
twice the group size of 8 bits, by adding one-half the desired 
memory operand size, in bytes to the virtual address oper- 
and. 

In an exemplary embodiment, the virtual address must 
either be aligned to 256 bytes, or must be the sum of an 
aligned address and one-half of the size of the memory 
operand in bytes and/or one-half of the size of the result in 
bytes. An aligned address must be an exact multiple of the 
size expressed in bytes. If the address is not valid an "access 
disallowed by virtual address" exception occurs. 

As shown in FIG. 18B. an exemplary embodiment of a 
w i d e - m n 1 1 i p 1 y - m a i r i x - G a 1 o i s -b y i e i nsi ni ci in n 
( W.MUL.M A1.G.8) multiplies memory [m255 m254 ... ml 
mO] with vector [ponmlkjihgfedcba], reducing the 
result modulo polynomial [q], yielding products [(pm2554- 
om247+ . . . +bm3l+am15 mod q) (pm254+om246+ - • - 
+bm30-fam14 mod q) . . . (pm24K+om240+ . . . +hm16+am0 
mod q)]. 

An exemplary embodiment of the pseudocode 1860 of the 
Wide Multiply Matrix Galois instruction is shown in FIG. 
18C. An exemplary embodiment of the exceptions 1890 of 
the Wide Multiply Matrix Galois instruction is shown in 
FIG. 18D. 

Memory Operands of Either Little-endian or Big- 
endian Conventional Byte Ordering 

In another aspect of the invention, memory operands of 
cither little-endian or big-endian conventional byte ordering 
are facilitated. Consequently, all Wide operand instructions 
arc specified in two forms, one for litllc-cndian byte ordering 
and one for big-endian byte ordering, as specified by a 
portion of the instruction. The byte order specifics to the 
memory system the order in which to deliver the bytes 
within units of the data path width (128 bits), as well as the 
order to place multiple memory words (128 bits) within a 
larger Wide operand. 

Extraction of a High Order Portion of a Multiplier 
Product or Sum of Products 

Another aspect of the present invention addresses extrac- 
tion of a high order portion of a multiplier product or sum 



,356 B2 

22 

of products, as a way of efficiently utilizing a large multiplier 
array. Related U.S. Pat. No. 5,742,840 and U.S. Pat. No. 
5,953,241 describe a system and method for enhancing the 
utilization of a multiplier array by adding specific classes of 
instructions to a general -purpose processor. This addresses 
the problem of making the most use of a large multiplier 
array that is fully used for high-precision arithmetic — for 
example a 64x64 bit multiplier is fully used by a 64-bit by 
64-bit multiply, but only one quarter used for a 32-bit by 
32-bit multiply) for (relative to the multiplier data width and 
registers) low-precision arithmetic operations. In particular, 
operations that perform a great many low-precision multi- 
plies which are combined (added) together in various ways 
are specified. One of the overriding considerations in select- 
15 ing the set of operations is a limitation on the size of the 
result operand. In an exemplary embodiment, for example, 
this size might be limited to on the order of 128 bits, or a 
single register, although no specific size limitation need 
exist. 

20 The size of a multiply result, a product, is generally the 
sum of the sizes of the operands, multiplicands and multi- 
plier. Consequently, multiply instructions specify operations 
in which the size of the result is twice the size of identically- 
sized input operands. For our prior art design, for example, 
25 a multiply instruction accepted two 64 -bit register sources 
and produces a single 128 -bit register-pair result, using an 
entire 64x64 multiplier array for 64-bit symbols, or half the. 
multiplier array for pairs of 32-bit symbols, or one quarter 
the multiplier array for quads of 16-bii symbols. For all of 
3l) these cases, note that two register sources of 64 bits are 
combined, yielding a 128-bit result. 

In several of the operations, including complex multiplies, 
convolve, and matrix multiplication, low-precision multi- 
plier products are added together. The additions further 
35 increase the required precision. The sum of two products 
requires one additional bit of precision; adding four products 
requires two, adding eight products requires three, adding 
sixteen products requires four. In some prior designs, some 
of this precision is lost, requiring scaling of the multiplier 
40 operands to avoid overflow, further reducing accuracy of the 
result. 

The use of register pairs creates an undesirable 
complexity, in that hoth the register pair and individual 
45 register values must be bypassed to subsequent instructions. 
As a result, with prior art techniques only half of the source 
operand 128-bit register values could be employed toward 
producing a single -register 128-bit result. 

In the present invention, a high-order portion of the 
50 multiplier product or sum of products is extracted, adjusted 
by a dynamic shift amount from a general register or an 
adjustment specified as part of the instruction, and rounded 
by a control value from a register or instruction portion as 
round-to-nearest/even, toward zero, floor, or ceiling. Over- 
^ 5 flows are handled by limiting the result to the largest and 
smallest values that can be accurately represented in the 
output result. 

Extract Controlled by a Register 

60 In the present invention, when the extract is controlled by 
a register, the size of the result can be specified, allowing 
rounding and limiting to a smaller number of bits than can 
fit in the result. This permits the result to be scaled to be used 
in subsequent operations without concern of overflow or 
rounding, enhancing performance. 

Also in the present invention, when the extract is con- 
trolled by a register, a single register value defines the size 
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of the operands, the shift amount and size of the result, and 
the rounding control. By placing all this control information 
in a single register, the size of the instruction is reduced over 
the number of bits that such a instruction would otherwise 
require, improving performance and enhancing flexibility of 
the processor. 

The particular instructions included in this aspect of the 
present invention arc Ensemble Convolve Extract, Ensemble 
Multiply Extract, Ensemble Multiply Add Extract and 
Ensemble Scale Add Extract. 

Ensemble Extract Inplace 

An exemplary embodiment of the Ensemble Extract 
Inplace instruction is shown in FIGS. 19A-19G. In an 
exemplary embodiment, several of these instructions 
(Ensemble Convolve Extract, Ensemble Multiply Add 
Extract) are typically available only in forms where the 
extract is specified as part of the instruction. An alternative 
embodiment can incorporate forms of the operations in 
which the size of the operand, the shift amount and the 
rounding can be controlled by the contents of a general 
register (as they are in the Ensemble Multiply Extract 
instruction). The definition of this kind of instruction for 
Ensemble Convolve Extract, and Ensemble Multiply Add 
Extract would require four source registers, which increases 
complexity by requiring additional general- register read 
ports. 

In an exemplary embodiment, these operations take oper- 
ands from four registers, perform operations on partitions of 
bits in the operands, and place the concatenated results in a 
fourth register. An exemplary embodiment of the format and 
operation codes 1910 of the Ensemble Extract Inplace 
instruction is shown in FIG. 19A. 

An exemplary embodiment of the schematics 1930, 1945, 
1960, and 1975 of the Ensemble Extract Inplace instruction 
is shown in FIGS. 19C, 19D, 19E, and 19F. In an exemplary 
embodiment, the contenis of registers rd, rc, rb, and ra are 
fetched, The specified operation is performed on these 
operands. The result is placed into register rd. 

In an exemplary embodiment, for the E.CON.X 
instruction, the contents of registers rd and rc are catenated, 
as c d, and used as a first value. A second value is the 
contents of register rb. The values are partitioned into groups 
of operands of the size specified and are convolved, pro- 
ducing a group of values. The group of values is rounded, 
limited and extracted as specified, yielding a group of results 
that is the size specified. The group of results is catenated 
and placed in register rd. 

In an exemplary embodiment, for the E.MUL.ADD.X 
instruction, the contents of registers rc and rb are partitioned 
into groups of operands of the size specified and are 
multiplied, producing a group of values to which arc added 
the partitioned and extended contents of register rd. The 
group of values is rounded, limited and extracted as 
specified, yielding a group of results that is the size speci- 
fied. The group of results is catenated and placed in register 
rd. 

As shown in FIG. 19B, in an exemplary embodiment, bits 
31 ... 0 of the contents of register ra specifies several 
parameters that control the manner in which data is 
extracted, and for certain operations, the manner in which 
the operation is performed. The position of the control fields 
allows for the source position to be added to a fixed control 
value for dynamic computation, and allows for the lower 16 
bits of the control field to be set for some of the simpler 
extract cases by a single GCOPYI.128 instruction. The 
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control fields are further arranged so that if only the low 
order 8 bits are non-zero, a 128-bit extraction with trunca- 
tion and no rounding is performed. 

In an exemplar}' embodiment, the table below describes 
5 the meaning of each label: 



in 



label 


bits 


meaning 


fsize 


8 


field size 


dpos 


8 


destination position 


X 




extended vs. group size result 


s 




signrd vs. unsigned 


» 




complex vs. real multiplication 


m 




mixed-sign vs. same-sign multiplication 


1 




limit: saturation vs. truncation 


ind 


2 


rounding 


gssp 


9 


group size and source position 



In an exemplary embodiment, the 9-bit gssp field encodes 
*° both the group size, gsize, and source position, spos, accord- 
ing to the formula gssp=512-4*gsizc+spos. The group size, 
gsize, is a power of two in the range 1 ... 128. The source 
position, spos, is in the range 0 . . . (2*gsize)-l. 
25 In an exemplary embodiment, the values in the x, s, n, m, 
1. and rnd fields have the following meaning: 



values 


X 


s 


n 


m 


1 rnd 


0 


group 


unsigned 


real 


same-sign 


truncate F 


1 


extended 


signed 


complex 


mixed-sign 


satuia.e 7. 


2 










N 


3 










C 



Ensemble Multiply Add Extract 

As shown in FIG. 19C, an exemplary embodiment of an 
cnscmblc-multiply-add-cxtract-doublcts instruction 
40 (E.MUI.ADDX) multiplies vector rc [h g f e d c b a] with 
vector rb [p o n m I k j i]. and adding vector rd [x wvut 
s r q], yielding the result vector rd [hp+x go+w fn+v em+u 
dl+t ck+s bj+r ai+q], rounded and limited as specified by 
ra31 ... 0. 

4:> As shown in FIG. 19D, an exemplary embodiment of an 
ensemble-multiply-add-extract-doublets-complex instruc- 
tion (E.MUL.X with n set) multiplies operand vector rc [h g 
f e d c b a] by operand vector rb [p o n m 1 k j i], yielding 
the result vector rd [gp+ho go-hp en+fm em-fn cl+dk ck-dl 

50 aj+bi ai-bj], rounded and limited as specified by ra31 ... 0. 
Note that this instruction prefers an organization of complex 
numbers in which the real part is located to the right (lower 
precision) of the imaginary part. 

55 Ensemble Convolve Extract 

As shown in FIG. 19E, an exemplary embodiment of an 
ensemble-convolve-extract-doublets instruction (ECON.X 
with n=0) convolves vector rc r d [x w v u t s r q p o n m 
6 q 1 k ji] with vector rb [h g f e d c b a], yielding the products 
vector rd 

[ax+bw+cv+du-fet+fs+gr+hq - . • as+br+cq+dp+eo+ln+ 
gm+hl ar+bq+cp+do+en+fm+gl+hk aq+bp+co+dn+ 
em+fl+gk+hj], rounded and limited as specified by 

65 ra 3i . . . u- 

As shown in FIG. 19F. an exemplary embodiment of an 
ensemble-convolve -exlract-coraplex-doublets instruction 



Case 2:05-cv-00505-TJW Document 129 Filed 09/1 2/2007 Page 4 of 15 



US 6,7: 

25 

(ECON.X with n=l) convolves vector rd rc [x w v u t s r q 
p o n m I k j i] with vector rfo [h g f e d c b a], yielding the 
products vector rd 

[ax+bw+cv+du+et+fs+gr+hq . . . as-bt+cq-dr+eo-fp-t- 
gm-hn ar+bq+cp+do-t-en+fm+gl+hk aq-br+co-dp+ 
em-fn+gk+hl], rounded and limited as specified bv 
ra31 . . . f). 

An exemplary embodiment of the pseudocode 1990 of 
Ensemble Extract Inplacc instruction is shown in FIG. 19G. 
In an exemplary embodiment, there are no exceptions for the 
Ensemble Extract Inplace instruction. 

Ensemble Extract 

An exemplary embodiment of the Ensemble Extract 
instrucliun is shown in FIGS. 20A-20J. In an exemplary 
embodiment, these operations take operands from three 
registers, perform operations ou partitions uf bits in the 
operands, and place the catenated results in a fourth register. 
Aji exemplary embodiment of the format and operation 
codes 2010 of the Ensemble Extract instruction is shown in 
FIG. 20A. 

An exemplary embodiment of the schematics 2020, 2030, 
2040, 2050, 2060, 2070, and 2080 of the Ensemble Extract 
Inplace instruction is shown in FIGS. 20C, 20D, 20E, 20F, 
20G, 20H, and 201. In an exemplary embodiment, the 
contents of registers rd, rc, and rb arc fetched. The specified 
opera i kin is performed on these operands. The result is 
placed into register ra. 

As shown in FIG. 20B, in an exemplary embodiment, bits 
31 ... 0 of the contents of register rb specifies several 
parameters that control the manner in which data is 
extracted, and for certain operations, the manner in which 
the operation is performed. The position of the control fields 
allows for the suurcc position to be added to a fixed control 
value for dynamic computation, and allows for the lower 16 
bits of the control field to be set for some of the simpler 
extract cases by a single GCOPYL.128 instmction. The 
control fields are further arranged so that if only the low 
order 8 bits are non-zero, a 128-bit extraction with trunca- 
tion and no rounding is performed. 

In an exemplary embodiment, the table below describes 
the meaning of each label: 



lahel 


hits 


meaning 


fsizc 




field sac 


dpos 


S 


destination position 


X 


J 


extended vs. group size result 


s 




signed vs. unsigned 


0 


1 


complex vs. real multiplication 


m 


J 


merge vs. extract or mixed-sign vs. 






same-sign multiplication 






limit: ialuration vs. truncation 


rnd 




rounding 


gssp 


9 


group size and source position 



In an exemplary embodiment, the 9-bit gssp field encodes 
both the group size, gsize, and source position, spos, accord- 
ing to the formula gssp=5l2 4*gsize+spos. The group size, 
gsize, is a power of two in the range 1 ... 12S. The source 
position, spos, is in the range 0 . . . (2*gsize)-l. 

In an exemplary embodiment, the values in the x, s» n. m, 
1 ; and rad fields have the following meaning: 
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values x s n m I rnd 

5 0 gioup unsigned real extract/ truncate F 

same-sign 

1 extended signed complex merge/ satura:e Z 

mixed-sign 

2 ^ N 

3 C 
21} ■ ■ 

In an exemplary embodiment, for the E.SGAI..ADD.X 
instruction, bits 127 ... 64 of the contents of register rb 
specifies the multipliers for the multiplicands in registers rd 
and rc. Specifically, bits 6442*gsize-l . . . 64+gsize is the 
" multiplier for the contents of register rd, and bits 
644gsize-l ... 64 is the multiplier for the contents of 
register rc. 

Ensemble Multiply Extract 

20 As shown in FIG. 20C, an exemplary embodiment of an 
ensemble-multiply-extract -doublets instruction (E.MULX) 
multiplies vector rd [h g f e d c b a] with vector rc [p o u m 
1 k j i], yielding the result vector ra [bp go fn em dl ck bj ai], 
rounded and limited as specified by rb 33 0 . 

25 As shown in FIG. 20D, an exemplary embodiment of an 
ensemble-multiply-extract-doublets-complex instruction 
(E.MUL X with n set) multiplies vector rd [h g T e rj c b a] 
by vector rc [p o n m I k j i], yielding the result vector ra 
[gp+ho go-hp en+fm em-fn cl+dk ck-dl aj+bi ai-bj], 

3IJ rounded and limited as specified by rb 31 0 . Note that this 
instruction prefers an organization of complex numbers in 
which the real part is located to the right (lower precision) 
of the imaginary part. 

Ensemble Scale Add Extract 

35 An aspect of the present invention defines the Ensemble 
Scale Add Extract instruction, that combines the extract 
control information in a register along with two values that 
are used as> scalar multipliers tu the contents of two vector 
multiplicands. 

40 Tli is combination reduces the number of registers that 
would otherwise be required, or the number of bits that the 
instruction would otherwise require, improving perfor- 
mance. Another advantage of the present invention is that 
the combined operation may be performed by an exemplary 

45 embodiment with sufficient internal precision on the sum- 
mation node that no intermediate rounding or overflow 
occurs, improving the accuracy over prior art operation in 
which more than one instruction is required to perform this 
computation. 

As shown in FIG. 20E, an exemplary embodiment of an 
eusemble-scale-add-extract-doublets instruction 
(E.SCAL.ADD.X) multiplies vector rd [h g f e d c b a] with 
rb 95 go [r] and adds the product to the product of vector 
rc [p o n m I k j i] with rb 7y M [q], yielding the result 
[hr+pq gr+oq fr+nq er+mq dr+lq cr+kq br+jq ar+iq], 

55 rounded and limited as specified by ib 3J 0 . 

As shown in FIG. 20F. an exemplary embodiment of an 
ensemble-sea le-add-extraci-doublets-complex instruction 
(b'.SCLADD.X with n set) multiplies vector rd [h g f e d c 
b a] with rb,~ 7 . 00 [t s] and adds the product to the product 

60 of vector rc [p o n m I kj i] with rb 9S M [r q], yielding the 
result [hs+gt+pq+or gs-ht+oq-pr fs+ei+nq+mr es-fl+mq- 
nr ds+ct+lq+kr cs-dt+kq-lr bs+ai+jq+ir as-bt+iq-jr), 
rounded and limited as specified by rb 3J 0 . 

6 - Ensemble Extract 

As shown in FIG. 20G. in an exemplary embodiment, for 
the E.EXTKACT instruction, when m=0 and xMJ, the 



Case 2:05-cv-00505-TJW Document 129 Filed 09/1 2/2007 Page 5 of 15 



US 6,7^ 

27 

parameters specified by ihe contents of register rb are 
interpreted to select fields from double size symbols of tbe 
catenated contents of registers rd and re, extracting values 
which are catenated and placed in register ra. 

As shown in FIG. 20H, in an exemplary embodiment, for 
an ensemble-merge -ex tract (C.EXTR ACT when m=1), the 
parameters specified by the contents of register rb are 
interpreted to merge fields from symbols of the contents of 
register rd with the contents of register re. The results are 
catenated and placed in register ra. The x field has no effect 
when m=l . 

As shown in FIG. 201, in an exemplary embodiment, for 
an cnscmblc-cxpand-cxtraci (E.EXTRACT when m=0 and 
x=l), the parameters specified by the contents of register rb 
arc interpreted to extract fields from symbols of the contents 
of register rd. Fhe results are catenated and placed in register 
ra. Note that the value of rc is not used. 

An exemplary embodiment of the pseudocode 2090 of 
Ensemble Extract iastruction is shown in FIG. 20.1. In an 
exemplary embodiment, there arc no exceptions for the 
Ensemble Extract instruction. 

Reduction of Register Read Ports 

Another alternative embodiment cau reduce the number 
of register read ports required for implementation of instruc- 
tions in which the size, shift and rounding of operands is 
controlled by a register. The value of the extract control 
register can be fetched using an additional cycle on an initial 
execution and retained within or near the functional unit for 
subsequent executions, thus reducing the amount of hard- 
ware required for implementation with a small additional 
performance penalty. The value retained would be marked 
invalid, causing a re-fetch of the extract control register, by 
instructions that modify the register or alternatively, the 
retained value can be updated by such an operation. A 
re-fetch of the extract control register would also be required 
if a different register number were specified on a subsequent 
execution. It should be clear that the properties of the above 
two alternative embodiments can be combined. 

Galois Field Arithmetic 

Another aspect of the invention includes Galois field 
arithmetic, where multiplies are performed by an initial 
binary polynomial multiplication (unsigned binary multipli- 
cation with carries suppressed), followed by a polynomial 
modulo/remainder operation (unsigned binary division with 
carries suppressed). The remainder operation is relatively 
expensive in area and delay. In Galois field arithmetic, 
additions are performed by binary addition with carries 
suppressed, or equivalently, a bitwise exclusive or operation. 
In this aspect of the present invention, a matrix multiplica- 
tion is performed using Galois field arithmetic, where the 
multiplies and additions are Galois field multiples and 
addi lions. 

Using prior art methods, a 16 byte vector multiplied by a 
16x16 byte matrix cao be performed as 256 8-bit Galois field 
multiplies and 16*15=240 8-bit Galois field additions. 
Included in the 256 Galois field multiplies are 256 polyno- 
mial multiplies and 256 polynomial remainder operations. 

By use of the present invention, the total computation is 
reduced significantly by performing 256 polynomial 
multiplies.. 240 16-bil polynomial additions, and 16 poly- 
nomial remainder operations. Note that the cost of the 
polynomial additions has been doubled compared wiih the 
Galois field additions, as these are now 16-bit operations 
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rather than 8-bit operations, but the cost of the polynomial 
remainder functions has been reduced by a factor of 16. 
Overall, this is a favorable tradeoff, as the cost of addition 
is much lower than the cost of remainder. 

5 

Decoupled Access From Execution Pipelines and 
Simultaneous Multithreading 

In yet another aspect of the present invention, best shown 
in FIG. 4, the present invention employs both decoupled 

lu access from execution pipelines and simultaneous multi- 
threading in a unique way. Simultaneous Multithreaded 
pipelines have been employed in prior art to enhance the 
utilization of data path units by allowing instructions to be 
issued from one of several execution threads to each func- 

15 tional unit (e.g. Dean M. Tullsen, Susan J. Eggers, and 
Henry M. Levy, "Simultaneous Multithreading: Maximizing 
On Chip Parallelism," Proceedings of the 22nd Annual 
International Symposium on Computer Architecture. Santa 
Margherita Figure, Italy, June, 1995). 

20 Decoupled access from execution pipelines have been 
employed in prior art lo enhance the utilization of execution 
data path units by buffering results from an access unit, 
which computes addresses to a memory unit that in turn 
fetches the requested items from memory, and then present- 

2:> ing them to an execution unit (e.g. J. E. Smith, "Decoupled 
Access/Execute Computer Architectures', Paxreedings of 
the Ninth Annual International Symposium on Computer 
Architecture, Austin, Tex. (Apr. 26 29, 1982), pp. 112-119). 

Compared to conventional pipelines, the Eggers prior art 
used an additional pipeline cycle before instructions could 
be issued to functional units, the additional cycle needed to 
determine which threads should be permitted to issue 
instructions. Consequently, relative lo conventional 
pipelines, the prior art design had additional delay, including 
dependent branch delay. 

The present invention contains individual access data path 
units, with associated register files, for each execution 
thread. These access units produce addresses, which are 

4r) aggregated together to a common memory unit, which 
fetches all tbe addresses and places the memory contents in 
one or more buffers. Instructions for execution units, which 
are shared to varying degrees among the threads are also 
buffered for later execution. ITie execution units then per- 

4S form operations from all active threads using functional data 
path units thai are shared. 

For instructions performed by the execution units, the 
extra cycle required for prior art simultaneous multithread- 
ing designs is overlapped with the memory data access lime 

50 from prior art decoupled access from execution cycles, so 
that no additional delay is incurred by the execution func- 
tional units for scheduling resources. For instructions per- 
formed by the access units, by employing individual access 
units for each thread the additional cycle for scheduling 

55 shared resources is also eliminated. 

This is a favorable tradeoff because, while threads do not 
share the access functional units, these units are relatively 
small compared lo the execution functional units, which arc 
shared by threads. 

60 With regard to the sharing of execution units, the present 
invention employs several different classes of functional 
units for the execution unit, with varying cost, utilization, 
and performance. In particular, the G units, which perform 
simple addition and bitwise operations is relatively inexpen- 

65 sive (in area and power) compared to the other units, and its 
utilization is relatively high. Consequently, the design 
employs four such units, where each unit can be shared 
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between two threads. The X unit, which performs a broad 
class of data switching functions is more expensive and less 
used, so two units arc provided that arc each shared among 
two threads. The T unit, which performs ihe Wide Translate 
instruction, is expensive and utilization is low, so the single 
unit is shared among all four threads. Ihe E unit, which 
performs the class of Ensemble instructions, is very expen- 
sive in area and power compared to the other functional 
units, but utilization is relatively high, so we provide two 
such units, each unit shared by two threads. 

In FIG. 4, four copies of an access unit are shown, each 
with an access instruction fetch queue A- Queue 401-404, 
coupled to an access register file AR 405-408, each of which 
is, in turn, coupled to two access functional units A 409-416. 
The access units function independently for four simulta- 
neous threads of execution. These eight access functional 
units A 409-416 produce results for access register files AR 
405-408 and addresses to a shared memory system 417. The 
memory contents fetched from memory system 417 arc 
combined with execute instructions not performed by the 
access unit and entered into the lour execute instruction 
queues E-Queue 421-424. Instructions and memory data 
from E-queue 421-424 are presented to execution register 
files 425-428, which fetches execution register file source 
operands. The instructions are coupled to the execution unit 
arbitration unit Arbitration 431, that selects which instruc- 
tions from the four threads are to be routed to the available 
execution units E 441 and 449 ; X 442 and 448, G 443-444 
and 446-447, and T 445. The execution register file source 
operands ER 425-428 are coupled to the execution units 
441-445 using source operand buses 451-454 and to the 
execution units 445-449 using source operand buses 
455-458. The function unit result operands from execution 
units 441-445 are coupled to the execution register file using 
result bus 461 and the function units result operands from 
execution units 445-449 are coupled to the execution reg- 
ister file using result bus 462. 

Improved Interprivilege Gateway 

In a still further aspect of the present invention, an 
improved interprivilege gateway is described which 
involves increased parallelism and leads to enhanced per- 
formance. In related U.S. patent application Ser. No. 08/541, 
416, a system and method is described for implementing an 
instruction that, in a controlled fashion, allows the transfer 
of control (branch) from a lower privilege level to a higher 
privilege level. The present invenrion is an improved system 
and method for a modified instruction that accomplishes the 
same purpose but with specific advantages. 

Many processor resources, such as control of the virtual 
memory system itself, input and output operations, and 
system control functions are protected from accidental or 
malicious misuse by enclosing them in a protective, privi- 
leged region. Entry to this region must be established only 
though particular entry points, called gateways, to maintain 
the integrity of these protected regions. 

Prior art versions of this operation generally load an 
address from a region of memory using a protected virtual 
memory attribute that is only sei for data regions that contain 
valid gateway entry points, then perform a brauch to an 
address contained in the contents of memory. Basically, 
three steps were involved: load, tlien branch and check. 
Compared to other instructions, such as register to register 
computation instructions and memory loads and stores, and 
register based branches, this is a substantially longer 
operation, which introduces delays and complexity to a 
pipelined implementation. 



15 



In the present invention, the branch -gateway instruction 
performs two operations in parallel: i).a branch is performed 
to the Contents of register 0 and 2) a load is performed using 
the contents of register 1, using a specified byte order 
(little-endian) and a specified size (64 bits). If the value 
loaded from memory does not equal the contents of register 
0, the instruction is aborted due to an exception. Iu addition, 
3) a return address (the next sequential instruction address 
following the branch-gateway instruction) is written into 
register 0, provided the instruction is not aborted. This 
approach essentially uses a first instruction to establish the 
requisite permission to allow user code to access privileged 
code, and then a second instruction is permitted to branch 
directly to the privileged code because of the permissions 
issued for the first instruction. 

In the present invention, the new privilege level is also 
contained in register 0, and the second parallel operation 
docs not need to be performed if the new privilege level is 
not greater than the old privilege level. When this second 
20 operation is suppressed, the remainder of the instruction 
performs an identical function to a branch-link instruction, 
which is used for invoking procedures that do not require an 
increase in privilege. The advantage that this feature brings 
is that the branch-gateway instruction can be used to call a 
procedure that may or may not require an increase in 
privilege. 

The memory load operation verifies with the virtual 
memory system that the region that is loaded has been 
tagged as containing valid gateway data. A further advantage 
of the present invention is that the called procedure may rely 
on the fact that register 1 contains the address that the 
gateway data was loaded from, and can use the contents of 
register 1 to locate additional data or addresses that the 
procedure may require. Prior art versions of this instruction 
required that an additional address be loaded from the 
gateway region of memory in order to initialize that address 
in a protected manner — the present invention allows the 
address itself to be loaded with a "normal" load operation 
that does not require special protection. 

The present invention allows a "normal*' load operation to 
also load the contents of register 0 prior to issuing the 
hranch-gateway iastruction. The value may be loaded from 
the same memory address that is loaded by the branch- 
gateway instruction, because the present invention contains 
a virtual memory system in which the region may be enabled 
for normal load operations as well as the special ^gateway" 
load operation performed by the branch-gateway instruction. 



25 



50 



Improved Interprivilege Gateway — System and 
Privileged Library Calls 



An exemplary embodiment of the System and Privileged 
Library Calls is shown in FIGS. 21A-21B. An exemplary 
embodiment of the schematic 2110 of System and Privileged 

55 Library Calls is shown in FIG. 21 A. In an exemplary 
embodiment, it is an objective to make calls to system 
facilities and privileged libraries as similar as possible to 
normal procedure calls as described above. Rather than 
invoke system calls as an exception, which involves signifi- 

60 cant latency and complication, a modified procedure call in 
which the process privilege level is quietly raised to the 
required level is used. To provide this mechanism safely, 
interaction with the virtual memory system is required. 
In an exemplary embodiment, such a procedure must not 

65 be entered from anywhere other than its legitimaie entry 
point, to prohibit entering a procedure after the point at 
which security checks are performed or with invalid register 
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contents, otherwise the access to a higher privilege level can 
lead to a security violation. In addition, the procedure 
generally must have access to memory data, for which 
addresses must be produced by the privileged code. To 
facilitate generating these addresses, the branch-gateway 
instruction allows the privileged code procedure to rely on 
the fact that a single register has been verified to contain a 
pointer to a valid memory region. 

In an exemplary embodiment, the branch-gateway 
instruction ensures both that the procedure is invoked at a 
proper entry point, and that other registers such as the data 
pointer and stack pointer can be properly set. To ensure this, 
the branch-gateway instruction retrieves a 'gateway 7 ' 
directly from the protected virtual memory space. The 
gateway contains the virtual address of the entry point of the 
procedure and the target privilege level. A gateway can only 
exist in regions of the virtual address space designated to 
contain them, and can only be used to access privilege levels 
at or below the privilege level at which the memory region 
can be written to ensure that a gateway cannot be forged. 

In an exemplary embodiment, the branch-gateway 
instruction ensures that register 1 (dp) contains a valid 
pointer to the gateway for this target code address by 
comparing the contents of register 0(lp) against the gateway 
retrieved from memory and causing an exception trap if they 
do not match. By ensuring that register 1 points to the 
gateway, auxiliary information, such as the data pointer and 
slack pointer cau be set by loading values located by the 
contents of register I. For example, the eight bytes following 
the gateway may be used as a pointer to a data region for the 
procedure. 

Id an exemplar)' embodiment, before executing the 
branch-gateway instruction, register 1 must be set to point at 
ihe gateway, and register 0 must be set to the address of the 
target code address plus the desired privilege level. A 
"L.I. 64. L. A rO=rl,0" instruction is one way to set register 0, 
if register 1 has already been set. but any means of getting 
the correct value into regisfer 0 is permissible. 

In an exemplary embodiment, similarly, a return from a 
system or privileged routine involves a reduction of privi- 
lege. This need not be carefully controlled by architectural 
facilities, so a procedure may freely branch to a less- 
privileged code address. Normally, such a procedure restores 
the stack frame, then uses the branch-down instruction to 
return. 

An cxcmplar>- embodiment of the typical dynamic-linked, 
inler-gateway calling sequence 213ft is shown in FIG. 21 R. 
In an exemplary embodiment, Ihe calling sequence is iden- 
tical to that of Ihe inter-modulc calling sequence shown 
above, except for the use of the B.GATE instruction instead 
of a B.UNK instruction. Indeed, if a B.GATE instruction is 
used when the privilege level in the !p register is not higher 
than the current privilege level, the B.GAFii instructiou 
performs an identical function to a B.LINK. 

In an exemplary embodiment, the callee, if it uses a stack 
for local variable allocation, cannot necessarily trust the 
value of the sp passed to it, as it can be forged. Similarly, any 
pointers which the callee provides should not be used 
directly unless it they are verified to point to regions which 
the callee should be permitted to address. "ITiis can be 
avoided by defining application programming interfaces 
(APIs) in which all values arc passed and returned in 
registers, or by using a trusted, intermedial privilege wrap- 
per routine to pass and return parameters. The method 
described below can also be used. 

In an exemplary embodiment, ii can be useful to have 
highly privileged code call less -privileged routines. For 



example, a user may request that errors in a privileged 
routine be reported by invoking a user-supplied error- 
logging routine. To invoke the procedure, the privilege can 
be reduced via the branch-down instruction. The return from 

5 the procedure actually requires an increase in privilege, 
which must be carefully controlled. This is dealt with by 
placing the procedure call within a lower-privilege proce- 
dure wrapper, which uses the branch-gateway instruction to 
return to the higher privilege region after the call through a 

iu secure re-entry point. Special care must be taken to ensure 
that the less-privileged routine is not permitted to gain 
unauthorized access by corruption of the stack or saved 
registers, such as by saving all registers and setting up a new 
stack frame (or restoring the original lower-privilege stack) 

15 that may be manipulated by the less-privileged routine. 
Finally, such a technique is vulnerable to an unprivileged 
routine attempting to use the re-entry point directly, so it 
may be appropriate to keep a privileged state variable which 
controls permission to enter at the re-entry point. 

20 

Improved Interprivflegc Gateway — Branch Gateway 

An exemplary embodiment of the Branch Gateway 
instruction is shown in FIGS. 21C-21F. In an exemplary 
embodiment, this operation provides a secure means to call 
25 a procedure, including those at a higher privilege level. An 
exemplary embodiment of the format and operation codes 
2160 of ihe Branch Gateway instruction is shown in FIG. 
21C. 

An exemplary embodiment of the schematic 2170 of the 

30 Branch Gateway instruction is shown in FIG. 21D. In an 
exemplary embodiment, the contents of register rb is a 
branch address in the high-order 62 bits and a new privilege 
level in the low-order 2 bits. A branch and link occurs to the 
branch address, and the privilege level is raised to the new 

35 privilege level. The high-order 62 bits of the successor to the 
current program counter is catenated with the 2 -bit current 
execution privilege and placed in register 0. 

In an exemplary embodiment, if the new privilege level is 
greater than the current privilege level, an octlct of memory 
data is fetched from the address specified by register 1, using 
the little-endian byte order and a gateway access type. A 
GatcwayDisallowcd exception occurs if the original con- 
tents of register 0 do not equal the memory data. 

45 In an exemplary embodiment, if the new privilege level is 
the same as the current privilege level, no checking of 
register 1 is performed. 

In an exemplary embodiment, an AccessDisal lowed 
exception occurs if the new privilege level is greater than the 

?0 privilege level required to write the memory data, or if the 
old privilege level is lower than the privilege required to 
access the memory data as a gateway, or if the access is not 
aligned on an 8-byte boundary. 

In an exemplary embodiment, a Ucscrvcdlnst ruction 

55 exception occurs if the re field is not one or the rd field is not 
zero. 

In an exemplary embodiment, in the example in FIG. 
21 D, a gateway from level 0 to level 2 is illustrated. The 
gateway pointer, located by the contents of register rc (1), is 

60 fetched from memory and compared against the contents of 
register rb (0). The instruction may only complete if these 
values are equal. Concurrently, the contents of register rb (0) 
is placed in the program counter and privilege level, and the 
address of the next sequential address and privilege level is 

65 placed into register rd (0). Code at the target of the gateway 
locates the data pointer at an offset from the gateway pointer 
(register 1), and fetches it into register 1, making a data 
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region available. A stack pointer may be saved and fetched dating different operand sizes. FIGS. 27B and 27C illustrate 

using the data region, another region located from the data an exemplary embodiment of a format and operation codes 

region, or a data region located as an offset from the original that can be used to perform the various Group Set instruc- 

gateway pointer. tions and Group Subtract instructions. As shown in FIGS. 

In an exemplary embodiment, this instruction gives the 5 27B and 27C, in this exemplary embodiment, the contents of 

target procedure the assurances that register 0 contains a registers re and rb are partitioned into groups of operands of 

valid return address and privilege level, that register 1 points tnc size specified and for Group Set instructions are com- 

to the gateway location, and that the gateway location is pared for a specified arithmetic condition or for Group 

ocllel aligned. Register 1 can Ihen be used to securely reach Subtract instructions arc subtracted, and if specified, 

values in memory. If no sharing of literal pools is desired w checked for overflf>w or Umited idc| . a group of results, 

register 1 may be used as a literal pool pointer directly. If each of whfch b ^ size ^ ^ f , fe 

sharing of literal pools is desired, register 1 mav be used . - , , , . . , . »^ 

... & • * «■ . . 1 j ■ 1 '1 • . catenated and placed in register rd. 

with an appropnate offset to load a new literal pool pointer; r e 

for example, with a one cache line offset from the register I Ensemble Convolve, Divide, Multiply, Multiply 

Note that because the virtual memory system operates with _ Sum 
cache line granularity, that several gateway locations must 1r> 

be created together. In the present embodiment, other fix-point group ope ra- 
in an exemplary embodiment, software must ensure that tions are also available. FIG. 28A presents various examples 
an attempt lu use any uctict within the region designated by G f Ensemble Convolve, Ensemble Divide. Ensemble 
virtual memory as gateway either functions properly or Multiply, and Ensemble Multiply Sum instructions accom- 
causcs a legitimate exception. For example, it the adjacent 20 modaling different operand sizes. FIGS. 28B and 28C illus- 
octlets contain pointers to literal pool locations, software trate an exemplary embodiment of a format and operation 
should ensure that these literal pools are not executable, or codes thal cau ^ used t0 per f orm tlie various Eusemble 
that by virtue of being aligned addresses, cannot raise the Convolve, Ensemble Divide, Ensemble Multiply and 
execution privilege level. If register 1 is used directly as a Ensemble Multiply Sum instructions. As shown in FIGS, 
literal pool location, software must ensure that the literal 2-- 28R and 28C, in this exemplary embodiment, the contents of 
pool locations that are accessible as a gateway do not lead registers rc and rb are partitioned into groups of operands of 
to a security violation. lhc size specified and convolved or divided or multiplied, 
In an exemplary embodiment, register 0 contains a valid yielding a group of results, or multiplied and summed to a 
return address and privilege level, the value is suitable for smg i e result. The group of results is catenated and placed, or 
use directly in the Branch down (B.DOWN) instruction to 3U the single result is placed, in register rd. 
return to the gateway callee. 

An exemplary embodiment of the pseudocode 2190 of the Ensemble Floating-point Add. Divide, Multiply, 

Branch Gateway instruction is shown in FIG. 21 E. An and Subtract 

exemplary embodiment of the exceptions 2199 of the . , .. . ... . . 

_ V * . . a * is In accordance with one embodiment of the invention, the 

Branch Gateway instruction is shown in FIG. 21F. " , , „ . . a 

3 processor also handles a variety floating-point group opera- 
Group Add tions accommodating different operand sizes. Here, the 
In accordance with one embodiment of the invention, the different operand sizes may represent floating point opcr- 
processor handles a variety fix-point, or integer, group ands of different precisions, such as half-precision (1 6 hits), 
operations. For example. FIG. 26A presents various *n single-precision (32 bits), double-precision (64 bits), and 
examples of Group Add instructions accommodating differ- quad -precision (128 bits). FIG. 29 illustrates exemplary 
cnt operand sizes, such as a byte (8 bits), doublet (16 bits), factions thai are defined for use within the detailed instruc- 
quadlet (32 bits), octlet (64 bits), and hexlet (128 bits). tion definitions in other sections and figures. In the functions 
FIGS. 26B and 26C illustrate an exemplary embodiment of set forlh in F1G 29 > an internal formal represents infinile- 
3 formal and operation codes that can be used to perform the 45 precision floating-point values as a four^elemem structure 
various Group Add instructions shown in FIG. 26A. As consisting of (1) s (sign bit): 0 for positive, 1 for negative, 
shown in FIGS. 26B and 26C. in this exemplary (2) t (type): NORM, ZERO, SNAN, QN AN, INFINITY, (3) 
embodiment, the contents of registers rc and rb are parti- e (exponent), and (4) f: (fraction). The mathematical inter- 
tioncd into groups of operands of the size specified and pretation of a normal value places the binary point at ihe 
added, and if specified, checked for overflow or limited, 50 ^iis of the fraction, adjusted by the exponent: (-1 s*(2'e) 
yielding a group of results, each of which is the size * f ^ faction F converts a packed IEEE floating-point 
specified. The group of results is catenated and placed in value int0 m{em * 1 forraat ^ function PackF converts an 
register rd. While the use of two operand registers and a internal format back into IEEE floating-point formal, with 
different result register is described here and elsewhere in rounding and exception control. 

t he present specificai ion, other arrangements, such as the use 55 FIGS. 30A and 31A present various examples of 
of immediate values, may also be implemented. Ensemble Floating Point Add, Divide, Multiply, and Sub- 
In the present embodiment, for example, if the operand lracl instructions. FIGS. 3UB-C and 31E-C illustrate an 
size specified is a bvte (8 bits), and each register is 128-bit exemplary embodiment of formats and operation codes that 
wide, then the content of each register may be partitioned can be used to perform the various Ensemble Floating Point 
into 16 individual operands, and 16 different individual add 60 Add * Divide, Multiply, and Subtract instructions, hi tliese 
operations mav lake place as the result of a single Group examples, Ensemble Floating Point Add, Divide, and Mul- 
Add instruction. Other instructions involving groups of lI P lv instructions have been labeled as "Ensemble Floating- 
operands may perform group operations in a similar fashion. Point/' Also, Ensemble Floating-Point Subtract instructions 

have been labeled as "Ensemble Reversed Floatingpoint." As 
Group Set and Group Subtract 65 in FIGS and nB<: ^ lhis exemplary 

Similarly, FIG. 27 A presents various examples of Group embodiment, the contents of registers rc and rb are parti- 
Set instructions and Group Subtraci instructions acoommo- tioned into groups ol operands of the size specified, and the 
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specified group operation is performed, yielding a group of 
results. The group of results is catenated and placed in 
register rd. 

In the present embodiment, the operation is rounded using 
the specified rounding option or using round -to -nearest if 
not specified. If a rounding option is specified, the operation 
raises a floating-point exception if a floating-point invalid 
operation, divide by zero, overflow, or underflow occurs, or 
when specified, if the result is iDexact. If a rounding option 
is not specified, floating-point exceptions are not raised, and 
are handled according to the default rules of IEEE 754. 

Ensemble Scale- Add Floating-point 

A novel instruction, Ensemble -Scale -Add improves pro- 
cessor performance by performing two sets of parallel 
multiplications and pairwise summing the products, llus 
improves performance for operations in which two vectors 
must be scaled by two independent values and then summed, 
providing two advantages over nearest prior art operations 
of a fused-multiply-add. To perforin this operation using 
prior art instructions, two instructions would be needed, an 
ensemble-multiply for one vector and one scaling value, and 
an ensemble -multiply -add for the second vector and second 
scaling value, and these operations are clearly dependent. In 
contrast, the present invention fuses both the two multiplies 
and the addition for each corresponding elements of the 
vectors into a single operation. The first advantage achieved 
is improved performance, as in an exemplary embodiment 
the combined operation performs a greater number of mul- 
tiplies in a single operation, thus improving utilization of the 
partitioned multiplier unit. The second advantage achieved 
is improved accuracy, as an exemplary embodiment may 
compute the fused operation with sufficient intermediate 
precision so that no intermediate rounding the products is 
required. 

An exemplary embodiment of the Ensemble Scale-Add 
Floating-point instruction is shown in FIGS. 22A-22B. In an 
exemplary embodiment, these operations take three values 
from registers, perform a group of floating-point arithmetic 
operations on partitions of bits in the operands, and place the 
concatenated results in a register. An exemplary embodi- 
ment of the format 22111 of the Ensemble Scale-Add 
Floating-point instruction is shown in FKi. 22A. 

In an exemplary embodiment, the contents of registers rd 
and rc are taken to represent a group of floating-point 
operands. Operands from register rd arc multiplied with a 
floating-point operand taken from the least -significant bits of 
the contents of register rb and added to operands from 
register rc multiplied with a floating-point operand taken 
from the next least-significant bits of the contents of register 
rb. The results are rounded to the nearest represeu table 
floating-point value in a single floating-point operation. 
Floating-point exceptions are not raised, and are handled 
according to the default rules of IEEE 754. The results are 
catenated and placed in register ra. 

An exemplary embodiment of the pseudocode 2230 of the 
Ensemble Scale-Add Floating-point instruction is shown in 
FIG. 22B. In an exemplary embodiment, there are no 
exceptions for the Ensemble Scale-Add Floating-point 
instruction. 

Performing a Three-input Bitwise Boolean 
Operation in a Single Instruction (Group Boolean) 

In a further aspect of the present invention, a system and 
method is provided for performing a three -input bitwise 
Boolean operation in a single instruction. A novel method is 
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used to encode the eight possible output states of such an 
operation into only seven bits, and decoding these seven bits 
back into the eight states. 

An exemplary embodiment of the Group Boolean instnic- 

5 tion is shown in FIGS. 23A-23C. In an exemplary 
embodiment, these operations lake operands from three 
registers, perform boolean operations on corresponding bits 
in the operands, and place the concatenated results in the 
third register. An exemplary embodiment of the format 2310 

1U of the Group Boolean instruction is shown in FIG. 23A. 
An exemplar}' embodiment of a procedure 2320 of Group 
Boolean instruction is shown in FIG. 23R. In an exemplary 
embodiment, three values are taken from the contents of 
registers rd, rc and rb. The ih and il fields specify a function 

15 of three bits, producing a single bit result, lfie specified 
function is evaluated for each bit position, and Ihe results are 
catenated and placed in register rd. In an exemplary 
embodiment, register id is both a source and destination uf 
this instruction. 

"° In an exemplary embodiment, the function is specified by 
eight bits, which give the result for each possible value of the 
three source bits in each bit position: 

25 

d 1 1 1 1 0000 

c 110 0 110 0 

b 10 10 10 10 

JX«W») *7 »'0 *« t 3 fz fi f O 



In an exemplary embodiment, a function can be modified 
by rearranging the bits of the immediate value. The table 
below shows how rearrangement of immediate value f 7 
o can reorder the operands d,c,b for the same function. 



Operation immediate 







stem 




J(d,b,c) 


f, f,*n 


j(d.c.d) 


U U U fx f& h f 4 fo 


j(c ; b,d) 




/(b.dx) 


tif* Ufzhfi Ufa 



In an exemplary embodiment, by using such a 
rearrangement, an operation of the form: b=/(d,c,b) can be 
recoded into a legal form: b=/(b,d.c). For example, the 
function: b=_f(d,c,b)=d?c:b cannot be coded, but the equiva- 
? q lent function: d=c?b:d can be determined by rearranging the 
code for d-f(d,c,b)=d?c:b, which is 11001010, according to 
the rule for /(d,c,b) J"(c.b,d), to the code 11011000. 

Encoding 

55 In an exemplary embodiment, some special characteris- 
tics of this rearrangement is the basis of the manner in which 
the eight function specification bits are compressed to seven 
immediate bits in this instruction. As seen in the tabic above, 
in the general case, a rearrangement of operands from 

60 J(d,c,b) to /(d,b,c).(intcrchanging rc and rb) requires inter- 
changing the values of fg and f 5 and the values of f- and f,. 

In an exemplary embodiment, among the 256 possible 
functions which this instruction can perform, one quarter of 
them (64 functions) are unchanged by this rearrangement. 

65 These functions have the property that f (> =f 5 and f 2 =fj- The 
values of rc and rb (Note that rc and rb are the register 
specifiers, not Ihe register contents) can be freely 
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interchanged, and so are sorted into rising or falling order to 
indicate the. value of £ 2 . (A special case arises when rc=rb, 
so the sorting of rc and rb cannot convey information. 
However, as only the values f ? , f 4 , f 3 , and can ever result 
in this case. f 6 , f 5 , f,, and f, need not be coded for this case, 
so no special handling is required.) These functions are 
encoded by tlic values of f 7 , f 6 , f^, £>, and f 0 in the immediate 



exemplary embodiment, there are no exceptions for the 
Group Boolean instruction. 

Improving the Branch Prediction of Simple 
Repetitive Loops of Code 



In yet a further aspect to the present invention, a system 
and method is described for improving the branch prediction 
of simple repetitive loops of code. In such a simple loop, the 
end of the loop is indicated bv a conditional branch back- 
In an exemplary embodiment, another quarter of the ™ ward l0 the beginning of the loop. The condition branch of 

such a loop is taken for each iteration of the loop except the 



field and f 2 by whether rorb, thus using 32 immediate 
values for 64 functions. 



functions have f 6 =l and f 5 =0. These functions are recoded 
by interchanging rc and rb, f 6 and f 5 . f 2 and f a . They then 
share the same encoding as the quarter of the functions 
where f 6 =0 and f 5 =l, and are encoded by the values of f 7 , f 4 , 
f 3 . f 2 , f l7 and 4 in the immediate field, thus using 64 
immediate values for 128 functions. 

In an exemplary embodiment, the remaining quarter of 
the functions have f 0 =f 3 and f- f r The half of these in which 
f 2 -land f !— 0 arc recoded by interchanging rc and rb, f 6 and 
f 5 , f 2 aud f r They then share the same encoding as the eighth 
of the functions where f 2 =0 and f 2 -l, and arc encoded by the 
values of f 7 , f 6 , f 4 , f 3 , and in the immediate field, thus 
using 32 immediate values for 64 functions. 

In an exemplary embodiment, the function encoding is 
summarized bv the table: 



final iteration, when it is not taken. Prior art branch predic- 
tion systems have employed finite state machine operations 
to attempt to properly predict a majority of such conditional 
branches, but without specific information as to the number 
of times the loop iterates, will make an error in prediction 
when the loop terminates. 

The system and method of the present invention includes 
providing a count field for indicating how many times a 
branch is likely to be taken before it is not taken, which 
enhances the ability to properly predict both the initial and 
final branches of simple loops when a compiler can deter- 
mine the number of iterations that the loop will be per- 
formed. This improves performance by avoiding mispredic- 
tion of the branch at the end of a loop when the loop 
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In an exemplary embodiment, the function decoding is 
summarized bY the table: 



terminates and instruction execution is to continue beyond 
the loop, as occurs in prior art branch prediction hardware. 
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From the foregoing discussion, it can be appreciated that 
an exemplary embodiment of a compiler or assembler 
producing the encoded instruction performs the steps above 
to encode the instruction, comparing the f6 and f5 values and 
the C2 and fl values of the immediate field to determine 
which one of several means of encoding the immediate field 
is to be employed, and that the. placement of the trb and trc 
register specifiers into the encoded instruction depends on 
the values of f2 (or fl) and f6 (or iS). 

An exemplary embodiment of the pseudocode 2330 of the 
Group Boole au instruction is shown in FIG. 23C. It can be 
appreciated from the code that an exemplary embodiment of 
a circuit that decodes this instruction produces the f2 and f 1 
values, when the immediate bits ih and iJ5 are zero, by an 
arithmetic comparison of the register specifiers rc and rb, 
producing a one (I) value for f2 and fl when rorb. In an 



Branch Hint 



An exemplary embodiment of the Branch Hint instruction 
is shown in FIGS. 24A-24C. In an exemplary embodiment, 
^. this operation indicates a future branch location specified by 
a register. 

In an exemplary embodiment, this instruction directs the 
instruction fetch unit of the processor that a branch is likely 
to occur count times at sinim instructions following the 
60 current successor instruction to the address specified by the 
contents of register rd. An exemplary embodiment of the 
formal 2410 of the Branch Hint instruction is shown in FIG. 
24A. 

In an exemplary embodiment, after branching count 
65 limes, the instruction fetch unit presumes that the branch at 
simm instructions following the current successor instruc- 
tion is not likely to occur. If count is zero, this hint directs 
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the instruction fetch unit that the branch is likely to occur 25B. An exemplary embodiment of the exceptions 2560 of 

more than 03 times. the Ensemble Sink Floating-point instruction is shown in 

In an exemplary embodiment, an Access disallowed FIG. 25C. 

exception occurs if the contents of register rd is not aligned An exemplary embodiment of the pseudocode 2570 of the 

on a quadlet boundary. 5 Floating-point instructions is shown in FIG. 25D. 
} An exemplary embodiment of the pseudocode 2430 of the 

Branch I lint instruction is shown in FIG. 24B. An exemplary Crossbar Compress, Expand, Rotate, and Shift 

embodiment of the exceptions 2460 of the Branch Hint InoDe embodiment of the invention, crossbar switch units 

instruction is shown m FIG. 24C. ^ such ^ ^ M gnd m ^ hmdl[ng opcra(ions 

. ,. „ . # , , as previously discussed. As shown in FIG. 32A, such data 

Incorporating Floating Point Information Into , * ... • . . . % 

Processor Instructions handling operations may include various examples of Cross- 

^ Compress, Crossbar Expand, Crossbar Rotate, and 

In a still further aspect of the present invention, a lech- Crossbar Shift operations. FIGS. 32B and 32C illustrate an 

nique is provided for incorporating floating point informa- i s exemplary embodiment of a format and operation codes that 

tion into processor instructions. In related U.S. Pat. No. can be used to perform Ihe various Crossbar Compress, 

5,812,439. a system and method are described for incorpo- Crossbar Rotate, Crossbar Expand, and Crossbar Shift 

rating cutiliul of rounding and exceptions for floating-point instructions. As shown in FIGS. 32B and 32C, in this 

instructions into the instruction itself. Ihe present invention exemplary embodiment, the contents of register rc are 

extends this invention to include separate instructions in 20 partitioned into groups of operands of the size specified, and 

which rounding is specified, but default handling of excep- compressed, expanded, rotated or shifted by an amount 

lions is also specified, for a particular class of floating-point specified by a portion of the contents of register rb, yielding 

instructions. a group of results. The group of results is catenated and 

placed in register rd. 

Ensemble Sink Floating-point 25 VarknJS Gf0Up Coiupress operal ions may convert groups 

In an exemplary embodiment, a Ensemble Sink Floating- operands from higher precision data to lower precision 

point instruction, which converts floating-point values to data. An arbitrary half-sized sub-field of each bit field can be 

integral values, is available with contiol iu the instruction selected to appear in the result. For example, FIG. 32D 

that include all previously specified combinations (default- shows an X.COMPRESS rd=rc,16,4 operation, which pcr- 

near rounding and default exceptions, Z-— round -toward- ?u forms a selection of bits 19 ... 4 of each quadlet in a hexlei. 

7.ero and trap on exceptions, N — round to nearest and trap on Various Group Shift operations may allow shifting of groups 

exceptions, F— floor rounding (toward minus infinity) and of operands by a specified number of bits, in a specified 

irap on exceptions, C— ceiling rounding (toward plus direction, such as shift righl or shift left. As can be seen in 

infinity) and trap on exceptions, and X — trap on inexact and FIG. 32C, certain Group Shift Left instructions may also 

other exceptions), as well as three new combinations (Z.D — 35 involve clearing (to zero) empty low order bits associated 

round toward zero and default exception handling, F.D — with the shift, for each operand. Certain Group Shift Right 

floor rounding and default exception handling, and CD — instructions may involve clearing (to zero) empty high order 

ceiling rounding and default exception handling). (The other bils associated with the shift, for each operand. Further, 

combinations: N.D is equivalent to the default, and X.D— certain Group Shift Right instructions may involve rilling 

trap on inexact but default handling for other exceptions is 40 en W ni g n order bils associated with the shift with copies 

possible but not particularly valuable). of the sign bit, for each operand. 

An exemplary embodiment of the Ensemble Sink 
Floating-point instruction is shown in FIGS. 25A-25C. In an 

exemplary embodiment, these operalions take one value ^_ In one embodiment of the invention, data handling opera- 

from a register, perform a group of floating-point arithmetic ~ tions may also include a Crossbar Extract instruction. FIGS, 

conversions to integer on partitions of bits in the operands, 33 A and 33B ai ustrat c an exemplary embodiment of a 

and place the concatenated results in a register. An cxem- format and operation codes ihat can be used to perform the. 

plary embodiment of the operation codes, selection, and Crossbar Extract instruction. As shown in FIGS. 33A and 

format 2510 of Ensemble Sink Floating-point instruction is ^ 33^ j n ^j* exemplary embodiment, the contents of registers 

shown in FIG. 25A. ^j, rcT an( j jfc are fetched. The specified operation is per- 

In an exemplary embodiment, the contents of register rc formed on these operands. The result is placed into register 

is partitioned into floating-point operands of the precision ra. 

specified and converted to integer values. The results are Crossbar Exlract inst mction allows bits to be 

catenated and placed in register rd. 55 extraclecJ from different operands in various ways. 

In an exemplary embodiment, the operation is rounded Specifically, bits 31 ... 0 of the contents of register d> 

using the. specified rounding option or using round-to- specifies several parameters which control ihe manner in 

nearest if not specified. If a rounding option is specified, w hj CD data is cxtiaclcd, and for certain operations, the 

unless ctef.uili exception handling is specified, the operation manner in which the operation is performed. The position of 

raises a floating-point exception it" a floating-point invalid ^ tne control fields allows for the source posilion to be added 

opcrai inn, divide by zero, overflow, or underflow occurs, or l0 a fixed control value for dynamic computation, and allows 

when specified, if the result is inexact. If a rounding option f or i ne i ower 10 bits of Ihe control field to be set for some 

is rail specified 1 »r if default exception handling is specified, Q f the simpler extract cases by a single GCOPY1.128 

floating-point exceptions are not raised, and are handled instruction (see appendix). The control fields are further 

according to the default rules of IEEE 754. 65 arranged so ihat if only the low order S bits are non-zero, a 

Anexemplary embodiment of the pseudocode 2530 of the 128-bit extraction with truncation and no rounding is per- 

Ensemble Sink Moating -point instruction is shown in FIG. formed.: 



Extract 
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Items of size v are divided into w piles and shuffled together, 
31 24 23 16 u 14 13 12 u 10 9 S 0 according to the value of op. Depending on the value of b, 

| fsizc | dpos I x I s [ a I m I l I rod | gssp ~| a sub-field of op, the low 128 bits (h=0), or the high 1 28 bits 

8 8 iiii j 2 9 (b=l) of the 256-bit shuffled contents are selected as the 

5 result. The result is placed in register rd. 

The table below describes the meaning of each label: , ^ sho ™ in ™ G ' *f D ' » exam P Ie ot ' * <£**™ 4 J W ^ 

shuffle of bytes within hcxlet instruction (X.S1 IUITLE.1 28 
rd=rcb,S,4) may divide the 128-bit operand into 16 bytes and 
partitions the bytes 4 ways (indicated by varying shade in the 
diagram below). The 4 partitions are perfectly shuffled, 
producing a 128-bit result. As shown in 11 CI. 33 li, an 
example of a crossbar 4-way shuffle of bytes within triclet 
instruction (X.SHUFFLE.256 rd=rc,rb.S,4,0) may catenate 
the contents of rc and rb, then divides the 256-bit content 
into 32 bytes and partitions the bytes 4 ways (indicated by 
varying shade in the diagram below). 'lTie low-order halves 
of the 4 partitions arc perfectly shuffled, producing a 128-bit 
result. 



label 


bits 


meaning 


fsiw 


S 


Held size 


dpos 


S 


destination position 


X 


3 


reserved 


s 


1 


signed vs. unsigned 


n 


1 


reserved 


m 


1 


merge vs. extract 


1 


1 


reserved 


rnd 




reserved 




9 


group size and source position 



20 Changing the last immediate value h to 1 

The 9-bit gssp field encodes both the group size, gsize, (X.SHUFFLE.256 rd=rc,rb 3,4,1) may modify the operation 

and source position, spos, according to the formula gssp= to perform the same function on the high-order halves of the 

512-4*gsdze+spos. The group size, gsize, is a power of two 4 partitions. When rc and rb arc equal, the tabic below shows 

in the range 1 . . . 12S. The source position, spos, is in the the value of the op field and associated values for size, v s and 

range 0...(2*gsi£c)-l. 25 w - 

The values in the s, n, m, 1, and md fields have the 
following meaning: 



values 


s 


n 


m 


1 rnd 


n 


unsigned 




exiract 




1 


signed 




merge 




2 










3 











As shown in FIG. 33C, for the X.EXTRACT instruction, 
when m=0, the parameters are interpreted to select a fields 
from the catenated contents of registers rd and rc, extracting 
values which are caienated and placed in register ra. As 4n 
shown in FIG. 33D, for a crossbar-merge-extract 
(X.EXTRACT when m«l), the parameters are interpreted to 
merge a fields from the contents of register rd with the 
contents of register rc. The results are catenated and placed 
in register ra. 4:> 

Shuffle • 

As shown in FIG. 34A, in one embodiment of the 
invention, data handling operations may also include various 50 
Shuffle instructions, which allow the contents of registers to 
be partitioned into groups of operands and interleaved in a 
variety of ways. FIGS. 34B and 34C illustrate an exemplary 
embodiment of a format and operation codes that can be 
used to perform the various Shuffle instructions. As shown ^ 5 
in FIGS. 34B and 34C, in this exemplary embodiment, one 
of two operations is performed, depending on whether the rc 
and rb fields are equal. Also, FIG. 34B and the. description 
below illustrate the formal of and relationship of the rd, rc, 
rb, op, v, w, h, and size fields. 

In the present embodiment, if the rc and rb fields arc 
equal, a 128-bii operand is taken from the contents of 
register rc. Items of size v are divided into w piles and 
shuffled together, within groups of size bits, according to the 
value of op. The result is placed in register rd. 

Further, if the rc and rb fields are not equal, the contents 
of registers rc and rb are catenated into a 256-bit operand. 



60 



op 


size 


V 


w 


0 


4 


1 


2 


] 


S 


1 


2 


2 


8 


2 


2 




8 


1 


4 


4 


16 


1 




5 


16 


2 


■? 


6 


16 


4 


■J 


7 


In 


1 


4 


8 


16 


■i 


4 




16 


1 


X 


to 


32 




•> 


TJ 


32 


2 


■j 


12 


32 


4 




l.i 


32 


8 


2 


14 


32 


1 


4 


J 5 


32 




4 


16 


32 


4 


4 


17 


32 


1 




1$ 


32 


2 


S 


19 


32 


1 


16 


20 


64 


J 


2 


21 


64 


2 


•j 


22 


64 


4 


2 


23 


64 


S 


2 


24 


64 


16 


■5 


25 


64 


1 


A 


26 


64 


2 


A 


27 


64 


4 


4 


28 


64 


3 


4 


29 


64 


1 


8 


30 


64 


2 


S 


51 


64 


4 


8 


32 


04 


1 


10 


33 


64 


I 


16 


34 


64 


1 


32 


35 


125 


1 


2 


36 


128 


2 


f 


37 


128 


4 


2 


3? 


128 


8 




39 


128 


16 


2 


40 


128 


52 


2 


41 


128 


1 


4 


42 


128 


2 


4 


43 


128 


4 


4 


44 


128 


8 


4 


45 


128 


16 


4 
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-continued 



o P 


size 


V 


w 


46 


128 


1 


8 


47 


128 




8 


4S 


128 


L 


8 


49 


128 


s 


8 


50 


128 


1 


16 


53 


128 




16 


52 


128 


L 


16 


55 


128 


J 


32 


5^ 


128 


2 


32 


55 


128 


1 


64 



When rc and rb are not equal, the table below shows the 
value of the op 4 Geld and associated values for size, v, and 
w: Op 5 is the value of h, which controls whether the 
low-order or high-order half of each partition is shuffled into 
the result. 



op4..0 


size 


V 


W 


0 


256 


1 




1 


256 


2 


2 


2 


256 


4 


2 


3 


256 


8 


2 


4 


256 


16 


2 


> 




:vj> 


7. 


6 


256 


64 


2 


7 


256 


1 


4 


8 


256 




4 


9 


256 


4 


4 


30 


256 


s 


4 


11 


256 


16 


4 


12 


256 


32 


4 


13 


256 


1 


8 


14 


256 


2 


8 


15 


256 


4 


8 


irt 


7.S6 


s 


8 


17 


256 


16 


S 


18 


256 


1 


16 


19 


256 




16 


20 


256 


4 


16 


2J 


256 


8 


16 


22 


256 


1 


32 


23 


256 




32 


24 


256 


4 


32 


25 


256 


1 


64 


26 


256 




64 


27 


256 


1 


128 



Conclusion 

Having fully described a preferred embodiment of the 
invention and various alternatives, those skilled in the art 
will recognize, given the teachings herein, that numerous 
alternatives and equivalents exist which do not depart from 
the invention. It is therefore intended that the invention not 
be limited by the foregoing description, but only by the 
appended claims. 

We claim: 

1. In a system having a data path functional unit having 
a functional unit data path width, a first memory system 
having a first data path width, and a second memory system 
having a data path width which is greater than the functional 
unit data path width and greater than the first data path 
width, a method comprising: 

copying a first memory operand portion from the first 
memory system to the second memory system, the first 
memory operand portion having the first data path 
width; and 
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copying a second memory operand portion from the first 
memory system to the second memory system, the 
second memory operand portion having the first data 
path width and being catenated in the second memory 
5 system wiih the first memory operand portion, thereby 
forming catenated data. 

2. The method of claim 1 further comprising reading at 
least a portion ol the catenated data which is greater in width 
than the first data path width. 

3. The meihod of claim 2 further comprising specifying a 
memory specifier from which a plurality of data path widths 
of data can be read. 

4. The method of claim 3 wherein the memory specifier 
comprises: 

a memory address; 

15 • . 

a memory size; and 
a memory shape. 

5. The method uf claim 2 further comprising checking the 
validity of the first memory operand portion and, if valid, 
permitting a subsequent instruction to access the first 

20 memory operand portion. 

6. The method of claim 2 further comprising checking the 
validity of the second memory operand portion and, if valid, 
permitting a subsequent instruction to access the second 
memory operand portion. 

25 7. In a system having a data path functional unit having 
a functional unit data path width, a first memory system 
having a first data path width, and a second memory system 
having a data path width which is greater than the functional 
unit data path widih and greater than the first data path 
3U width, a method comprising: 

copying a first memory operand portion from the first 
memory system to the second memory system, the first 
memory operand portion having the first data path 
35 width: 

copying a second memory operand portion from the first 
memory system to the second memory system, the 
second memory operand portion having the first data 
path width; and 
4Q catenating the second memory operand portion in the 
second memory system with the first memory operand 
portion, thereby forming catenated data. 

8. The method of claim 7 further comprising reading at 
least a portion of the catenated data which is greater in width 

45 than the first data path width. 

9. Ifie method of claim 8 further comprising specifying a 
memory specifier from which a plurality of data path widths 
of data can be read. 

10. The method of claim 9 wherein the memory specifier 
50 comprises: 

a memory address; 
a memory size; and 
a memory shape. 

1J. The method of claim 8 further comprising checking 
55 the validity of the first memory operand portion and, if valid, 
permitting a subsequent instruction in access the first 
memory operand portion. 

12. The method of claim 8 further comprising checking 
the validity of the second memory operand portion and, if 

60 valid, permittiug a subsequent instruction to access the 
second memory operand portion. 

13. In a system having a data path functional unit having 
a functional unit data path width, a first memory system 
having a first data path width, and a second memory system 

65 having a data path width which is greater than the functional 
unit data path width and greater than the first data path 
width, a system comprising: 
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a first copying module configured lo copy a first memory 
operand portion from the first memory system to tbe 
second memory system, the first memory operand 
portion having tbe first data path width; and 

a second copying module configured to copy a second 
memory operand portion from the first memory system 
to the second memory system, the second memory 
operand portion having the first data path width and 
being catenated in the second memory system with the 
first memory operand portion, thereby forming cat- 
enated data. 

14. The system of claim 13 further comprising a reading 
module configured to read at least a portion of the catenated 
data which is greater in width than the first data path width. 

15. In a system having a data path functional unit having 
a functional unit data path width, a first memory system 
having a first data path width, and a second memory system 
having a data path width which is greater than the functional 
unit data path width and greater than the first data path 
width, a system comprising: 

a first copying module configured to copy a first memory 
operand portion from the first memory system to the 
second memory system, the first memory operand 
portion having tbe first data path width; and 

a second copying module configured lo copy a second 
memory operand portion from the first memory system 
to the second memory system, the second memory 
operand portion having the first data path width. 

16. The system of claim 15 further comprising a catenat- 
ing module configured to catenate in the second memory 
system the second memory operand portion with the first 
memory operand portion, thereby forming catenated data. 

17. 'Ihe system of claim 16 further comprising a reading 
module configured to read at least a portion of the catenated 
data which is greater in width than the first data path width. 

18. A method of processing a data stream in a general 
purpose processor capable of operation independent of 
another host processor, the general purpose processor having 
a virlual memory addressing unil, an instruction path and a 
data path to digitally process the data stream, the method 
comprising: 

receiving the daia stream over the data path; 

dynamically partitioning the data stream based on an 
elemental width of the data and storing partitioned data 
in registers of a register file coupled to the data path, 
wherein a number of data elements stored in a register 
is inversely related to tbe elemental width of the data 
stored in partitioned fields of ihe register; 

performing group floating point operations on multiple 
operands stored in partitioned fields of registers and, 
for each group floating point operation, returning cat- 
enated results of the operation to a register. 

19. The method of claim 18 wherein the data stream 
comprises media data. 

20. Tbe method of claim 19 wherein the media data 
comprises broadband communications data. 

21. The mclhod of claim 19 wherein ihe media dala 
comprises audio data. 

22. The method of claim 19 wherein the media data 
comprises image data. 

23. Hie method of claim 19 wherein the media data 
comprises video data. 

24. The method of claim 19 wherein the media data 
comprises compressed dala. 

25. The method of claim 19 wherein the media data 
comprises error checking dala. 
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26. The method of claim 19 wherein the media data 
comprises error correction data. 

27. The method of claim 18 wherein, for a specific group 
floating point operation performed, the catenated results of 

5 the specific operation arc returned to a register that is 
different than tbe registers used to store the multiple oper- 
ands for the specific operation. 

28. The method of claim 18 wherein the performing step 
comprises performing group add, group subtract and group 
multiply arithmetic operations on catenated floating-point 
data and, for each such group operation, returning catenated 
results of the operation to a register. 

29. The method of claim 18 wherein the performing group 
floating-point operations comprises operating, in parallel, on 
multiple operands slored in partitioned fields of registers. 

15 30. The method of claim 18 wherein the performing group 
floating-point operations comprises performing a first group 
floating-point operation on floating-point data of a first 
precision and performing a second group floating point 
operation on flo a ting-point data of a second precision that is 

20 a lower precision lhan the first precision by operating, in 
parallel, on at least two operands stored in partitioned fields 
of registers. 

31. The method of claim 18 further comprising perform- 
ing group integer operations on multiple operands slored in 

25 partitioned fields of registers and, for each group integer 
operation, returning catenated results of the operation to a 
register. 

32. The method of claim 31 wherein, for a specific group 
integer operation performed, the catenated results of the 

30 specific operation are returned to a register that is different 
than the registers used to store the multiple operands for the 
specific operation. 

33. The method of claim 31 wherein the performing group 
integer operations comprises performing group add, group 

35 subtract and group multiply arithmetic operations on cat- 
enated integer data and, for each such group operation, 
returning catenated results of the operation to a register. 

34. The method of claim 31 wherein the performing group 
integer operations comprises operating, in parallel, on mul- 

40 tiple operands stored in partitioned fields of registers. 

35. The method of claim 31 wherein the performing group 
integer operations comprises performing a first group integer 
operation on integer data of a first precision and performing 
a second group integer operation on integer data of a second 

45 precision that is a lower precision than the first precision by 
operating, in parallel, on at least two operands stored in 
partitioned fields of registers. 

36. The method of daim 18 further comprising inform- 
ing one or more group data handling operations that operate 

50 on multiple operands stored in partitioned fields of operand 
registers and, for each group data handling operation, return- 
ing catenated results of the operalion lo a register. 

37. 'ITie method of claim 36 wherein the performing one 
or more group data handling operations comprises convert- 

55 ing a plurality of n-bit data elements in a first operand 
register and a plurality of n-bit data elements in a second 
operand register into a plurality of n/2-bit data elements. 

38. The media processor of claim 37 wherein the con- 
verting step shifts each of the plurality of n/2-bit data 

60 elements by a specified number of bits during the conver- 
sion. 

39. Hie method of claim 36 wherein the performing one 
or more group data handling operations comprises interleav- 
ing a plurality of data elements selected from a first operand 

65 register with a plurality of data elements selected from a 
second operand register and catenating the data elements 
into a result register. 
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40. The method of claim 36 wherein the performing one 
or more data handling operations comprises shifting bits of 
individual data elements catenated in an operand register to 
the left and clearing empty low order bits of the individual 
data elements to zero. 

41. The method of claim 36 wherein the performing one 
or more data handling operations comprises shifting bits of 
individual data elements catenated in an operand register to 
the right and filling empty high order bits of the individual 
data elements with a value equal to a value stored in a sign 
bit of the individual data element. 

42. The method of claim 36 wherein the performing one 
or more data handling operations comprises shifting bits of 
individual data elements catenated in an operand register to 
the right and clearing empty high order bits of the individual 
data elements to zero. 

43. The method of claim 36 wherein the performing one 
or more group data handling operations comprises 
operating, in parallel, on multiple operands stored in parti- 
tioned fields of registers. 

44. The method of claim 36 wherein the performing one 
or more group data handling operations comprises perform- 
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ing a first group data handling operation on data of a first 
precision and performing a second group data handling 
operation on data of a second precision that is a lower 
precision than the first precision by operating, in parallel, on 
5 at least two operands stored in partitioned fields of registers. 

45. The method of claim 18 wherein the group floating 
point operations are associated with a plurality of instruction 
streams from a plurality of threads executing in parallel on 

20 the processor. 

46. The method of claim 18 wherein the performing step 
comprises performing gTOUp floating-point operations on 
data having a total aggregate width of 128 bits. 

47. The method of claim 18 wherein the performing step 
1 * comprises performing group floating-point operations on 

data of more than one precision. 

48. 1 "he method of claim 18 further comprising storing 
floating-point data in a register file in a format conforming 

20 to IEEE standard 754. 

* * * * + 



